Re: [net-next-2.6 PATCH] ipoib: remove addrlen check for mc addresses
Eli Cohen wrote: Could you send a link to the git tree where I can find this commit and the related fixes? basically, as the subject line suggests, it should be in Dave's net-next tree Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3 for-2.6.35] ib/iser: fix multipathing over iser, reduce fail-over time
Roland, This patch series fixes and reduces DM multipath fail-over / time over iscsi/iser, the core patch is #3. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] ib/iser: remove buggy back-pointer setting
iscsi connection object life cycle includes binding and unbinding (conn_stop) to/from the iscsi transport connection object. Since iscsi connection objects are recycled, on the time the transport connection (e.g iser's ib connection) is released it is illegal to touch the iscsi connection tied to the transport back-pointer, as it may already point to a different transport connection. Signed-off-by: Or Gerlitz ogerl...@voltaire.com --- drivers/infiniband/ulp/iser/iser_verbs.c |2 -- 1 file changed, 2 deletions(-) Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c === --- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c +++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c @@ -346,8 +346,6 @@ static void iser_conn_release(struct ise /* on EVENT_ADDR_ERROR there's no device yet for this conn */ if (device != NULL) iser_device_try_release(device); - if (ib_conn-iser_conn) - ib_conn-iser_conn-ib_conn = NULL; iscsi_destroy_endpoint(ib_conn-ep); } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
The iser connection teardown flow isn't over till the underlying Connection Manager (e.g the IB CM) delivers a disconnected or timeout event through the RDMA-CM. When the remote (target) side isn't reachable, e.g when some HW e.g port/hca/switch isn't functioning or taken down administratively, the CM timeout flow is used and the event may be generated only after relatively long time, in the order of tens of seconds. The current iser code exposes this possibly long delay to higher layers, specifically to the iscsid daemon and iscsi kernel stack. As a result, the iscsi stack doesn't respond well, to the extent of this low-level CM delay being added to the fail-over time under HA schemes such as the one provided by DM multipath through the multipathd(8) service. This patch enhances the reference counting scheme on iser's IB connections such that the disconnect flow initiated by iscsid from user space (ep_disconnect) isn't waiting for the CM to deliver the disconnect/timeout event. On the other hand, the connection teardown isn't done from iser's view point till the event is delivered. The iser ib (rdma) connection object is destroyed when its reference count reaches zero. When this happens on the RDMA-CM callback context, extra care is taken such that the RDMA-CM does the actual destroying of the associated ID as doing it in the callback is prohibited. The reference count of iser ib connection would normally reach three, where the ref, deref relations are 1. conn init, terminate 2. conn bind, stop/destroy 3. cma id create, disconnect/error/timeout callbacks Signed-off-by: Or Gerlitz ogerl...@voltaire.com --- with this patch, multipath fail-over time is about 30 seconds, which is seen here, when a DD over the multi-path device is done before/during/after the fail-over regulary, before taking a port down # dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k 17179869184 bytes (17 GB) copied, 16.926 s, 1.0 GB/s taking a port down, causing fail-over during IO # dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k 17179869184 bytes (17 GB) copied, 46.6117 s, 369 MB/s after path-failure, back to speed # dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k 17179869184 bytes (17 GB) copied, 16.6474 s, 1.0 GB/s 13:00:09 iser: iser_event_handler:async event 10 on device mlx4_0 port 1 13:00:24 connection8:0: ping timeout of 10 secs expired, recv timeout 5, last rx [...] 13:00:24 connection8:0: detected conn error (1011) 13:00:24 iscsid: Kernel reported iSCSI connection 8:0 error (1011) state (3) 13:00:39 cto-1 kernel: device-mapper: multipath: Failing path 8:48. 13:00:39 cto-1 multipathd: 8:48: mark as failed 13:00:39 cto-1 multipathd: mpathd: remaining active paths: 1 -- the disconnected event is delivered after the IB CM timeout expires -- but fail-over doesn't pend on this 13:01:56 iser: iser_cma_handler:event 10 status 0 conn 88022dcb39b0 id 88022cf09400 without this patch, multipath fail-over time is about 130 seconds before taking a port down # dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k 17179869184 bytes (17 GB) copied, 16.6812 s, 1.0 GB/s taking a port down during IO # dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k 17179869184 bytes (17 GB) copied, 145.094 s, 118 MB/s after fail-over, back to speed # dd if=/dev/zero of=/dev/dm-0 bs=128k count=128k 17179869184 bytes (17 GB) copied, 16.8935 s, 1.0 GB/s 14:24:05 iser: iser_event_handler:async event 10 on device mlx4_0 port 1 14:24:20 connection4:0: ping timeout of 10 secs expired, recv timeout 5, last rx [...] 14:24:20 kernel: connection4:0: detected conn error (1011) 14:24:21 iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3) -- the disconnected event is delivered after the IB CM timeout expires -- fail-over pending on this 14:25:59 iser: iser_cma_handler:event 10 conn 88022625a1b0 id 880222537c00 14:26:14 session4: session recovery timed out after 15 secs 14:26:14 device-mapper: multipath: Failing path 8:64. 14:26:14 multipathd: mpathd: remaining active paths: 1 drivers/infiniband/ulp/iser/iscsi_iser.c |9 ++- drivers/infiniband/ulp/iser/iscsi_iser.h |3 - drivers/infiniband/ulp/iser/iser_verbs.c | 72 +-- 3 files changed, 46 insertions(+), 38 deletions(-) Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c === --- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c +++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c @@ -238,7 +238,7 @@ alloc_err: * releases the FMR pool, QP and CMA ID objects, returns 0 on success, * -1 on failure */ -static int iser_free_ib_conn_res(struct iser_conn *ib_conn) +static int iser_free_ib_conn_res(struct iser_conn *ib_conn, int can_destroy_id) { BUG_ON(ib_conn == NULL); @@ -253,7 +253,8 @@ static int iser_free_ib_conn_res(struct if (ib_conn-qp != NULL) rdma_destroy_qp(ib_conn-cma_id); - if (ib_conn-cma_id
Re: [PATCH/RFC] cxgb4: Add MAINTAINERS info
Roland Dreier wrote: +CXGB4 ETHERNET DRIVER (CXGB4) not sure who's the butterfly that caused this, but this was somehow committed as CXGB4 ETHERNET DRIVER (CXGB3) and same goes for the IW_ piece Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch v3] infiniband: ulp/iser, fix error retval in iser_create_ib_conn_res
Roland Dreier wrote: Or, I don't think we ever fixed this. This patch looks correct to me, any problem with merging this for 2.6.35? Roland, please use V4 below, the patch is okay and would apply before and after applying the multipathing patches I sent yesterday (same goes for them). [PATCH V4] ib/iser: fix error flow in iser_create_ib_conn_res From: Dan Carpenter erro...@gmail.com We shouldn't free things here because we free them later. The call tree looks like this: iser_connect() == initiating the connection establishment and later iser_cma_handler() = iser_route_handler() = iser_create_ib_conn_res() if we fail here, eventually iser_conn_release() is called, resulted in double free. Signed-off-by: Dan Carpenter erro...@gmail.com Signed-off-by: Or Gerlitz ogerl...@voltaire.com --- V1 fixed unreachable code V2 noticed that the original code had a double free V3 Roland Dreier points out that I left a dangling ERR_PTR() in ib_conn-fmr_pool which would be freed later on. V4 reviewed/enhanced the change-log --- drivers/infiniband/ulp/iser/iser_verbs.c | 25 + 1 file changed, 9 insertions(+), 16 deletions(-) Index: linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c === --- linux-2.6.34-rc6.orig/drivers/infiniband/ulp/iser/iser_verbs.c +++ linux-2.6.34-rc6/drivers/infiniband/ulp/iser/iser_verbs.c @@ -163,10 +163,8 @@ static int iser_create_ib_conn_res(struc device = ib_conn-device; ib_conn-login_buf = kmalloc(ISER_RX_LOGIN_SIZE, GFP_KERNEL); - if (!ib_conn-login_buf) { - goto alloc_err; - ret = -ENOMEM; - } + if (!ib_conn-login_buf) + goto out_err; ib_conn-login_dma = ib_dma_map_single(ib_conn-device-ib_device, (void *)ib_conn-login_buf, ISER_RX_LOGIN_SIZE, @@ -175,10 +173,9 @@ static int iser_create_ib_conn_res(struc ib_conn-page_vec = kmalloc(sizeof(struct iser_page_vec) + (sizeof(u64) * (ISCSI_ISER_SG_TABLESIZE +1)), GFP_KERNEL); - if (!ib_conn-page_vec) { - ret = -ENOMEM; - goto alloc_err; - } + if (!ib_conn-page_vec) + goto out_err; + ib_conn-page_vec-pages = (u64 *) (ib_conn-page_vec + 1); params.page_shift= SHIFT_4K; @@ -198,7 +195,8 @@ static int iser_create_ib_conn_res(struc ib_conn-fmr_pool = ib_create_fmr_pool(device-pd, params); if (IS_ERR(ib_conn-fmr_pool)) { ret = PTR_ERR(ib_conn-fmr_pool); - goto fmr_pool_err; + ib_conn-fmr_pool = NULL; + goto out_err; } memset(init_attr, 0, sizeof init_attr); @@ -216,7 +214,7 @@ static int iser_create_ib_conn_res(struc ret = rdma_create_qp(ib_conn-cma_id, device-pd, init_attr); if (ret) - goto qp_err; + goto out_err; ib_conn-qp = ib_conn-cma_id-qp; iser_err(setting conn %p cma_id %p: fmr_pool %p qp %p\n, @@ -224,12 +222,7 @@ static int iser_create_ib_conn_res(struc ib_conn-fmr_pool, ib_conn-cma_id-qp); return ret; -qp_err: - (void)ib_destroy_fmr_pool(ib_conn-fmr_pool); -fmr_pool_err: - kfree(ib_conn-page_vec); - kfree(ib_conn-login_buf); -alloc_err: +out_err: iser_err(unable to alloc mem or create resource, err %d\n, ret); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
Or Gerlitz ogerl...@voltaire.com wrote: [...] with this patch, multipath fail-over time is about 30 seconds, which is seen here, when a DD over the multi-path device is done before/during/after the fail-over [...] without this patch, multipath fail-over time is about 130 seconds Hi Roland, as we're @ -rc7 now, I wanted to check with you if there's any issue merging this patch series for 2.6.35. If you have any question or anything need to be addressed/fixed, I'd like to do that sooner rather then later. Or -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ib/iser: enhance disconnection logic for multi-pathing
Roland Dreier rdre...@cisco.com wrote: I have these 3 + Dan Carpenter's fix applied now. cool Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCHv8 02/11] ib_core: IBoE support only QP1
Eli Cohen wrote: Roland Dreier wrote: @@ -1007,7 +1010,7 @@ static void ib_sa_add_one(struct ib_device *device) - sa_dev = kmalloc(sizeof *sa_dev + + sa_dev = kzalloc(sizeof *sa_dev + Do you happen to remember why you needed these kmalloc - kzalloc conversions? I can't remember why. I do have this habbit of prefering kzalloc over kmalloc because it saves troubles sometimes. Hi Eli, just a friendly comment, best if such cleanup is done in a separate patch, else later someone attempting to debug/bisect (who might be yourself btw) could spend a hell of time wondering why it was done here and in the framework of this patch... Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCHv8 03/11] IB/umad: Enable support only for IB ports
Eli Cohen wrote: Roland Dreier wrote: Why do we not allow umad for IBoE ports? I understand there's no QP0 but why can't userspace use QP1 just like for IB link layer ports? Currently QP1 is only used by the CM protocol which is implemented in the kernel. Since we handle the iboe specific flow in the cma rather than the SA, there is no need to expose qp1 to userspace. Eli, any reason not to let reading (e.g perfquery) the HCA/port traffic counters with IBoE? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] librdmacm 1.0.12
Sean Hefty wrote: I've pushed out release 1.0.12 of librdmacm. Hi Sean, below is a tiny patch which will help direct users to the correct mailing list set the mailing list info to be linux-rdma instead of the ofa general list signed-off-by: Or Gerlitz ogerl...@voltaire.com diff --git a/configure.in b/configure.in index d6c4a62..d0f2623 100644 --- a/configure.in +++ b/configure.in @@ -1,7 +1,7 @@ dnl Process this file with autoconf to produce a configure script. AC_PREREQ(2.57) -AC_INIT(librdmacm, 1.0.12, gene...@lists.openfabrics.org) +AC_INIT(librdmacm, 1.0.12, linux-rdma@vger.kernel.org) AC_CONFIG_SRCDIR([src/cma.c]) AC_CONFIG_AUX_DIR(config) AM_CONFIG_HEADER(config.h) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?
Moni Shoua wrote: Did you try OFED-1.5.1 or even better, OFED-1.5.2? I know patches for counters with RoCEE were submitted since OFED-1.5 and I saw it working Mony, I'm not using ofed, sorry... I am interested in a clarification in the context of the upstream submission, e.g does the problem exist in the latest patch-set, is there a bz case tracking this, etc. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?
Eli Cohen wrote: counter should work as regular in upstream kernel patches for IB link layer. okay good, can you validate that? basically, I can set some time to clone Roland's tree and use the iboe branch as a basis for testing that the IB stack is live and kicking as it used to be before the patches. I just need an updated copy of the rest of the patch set (Roland has three patches so far) for that end. Over the review process there were bunch of comments but no new posting, how are you planning to proceed in the review/merge process? for IBoE, they will not work since the SMA does not support them I have patches that allow to show counters using sysfs I am not with you. The counters are read using a MAD sent to the firmware PMA (QP #1), this applies for both perfquery and sysfs, isn't it? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: are IB port counters functioning over --IB-- with the RDMAoE patch set?
Eli Cohen wrote: Why are you asking me to validate that? Did you actually encounter a problem with this? yes, I did. It didn't work with some ofed drop I was using. Anyway, as I said, I can do some validation that IBoE doesn't break upstream IB, just need the patches for that end, so once they are available, I will give them a try over 2.6.35-rcX The counters patches will divert the code: for iboe it will not issue a MAD to the firmware. It will use another command. Can you be more specific what is the origin for this new design? is it HW limitation or firware limitation or something else? In case its not hardware limitation, I don't think we need to go for non MAD based scheme, at least not for mlx4 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage
On Fri, Jun 11, 2010 at 3:47 PM, Chien Tung chien.tin.t...@intel.com wrote: V2 changes: What you consider to be V1, this thread from 2007? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
Walukiewicz, Miroslaw miroslaw.walukiew...@intel.com wrote: The patch adds a new test application describing a usage of the IBV_QPT_RAW_ETH for IPv4 multicast acceleration on iWARP cards. See man mcraw for parameters description So this is the only raw qp related patch to librdmacm? any reason not to patch mckey to support both IB and Ethernet raw QPs? does raw qp has any relation to the iWARP/TOE HW stack? there's also raw qp patch posted to ewg for mlx4 which has no backing iwarp logic. Or -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCH] node description patch
Mike Heinz wrote: This patch fixes a problem with the openibd initialization script. On machines using slower DHCP servers, openibd frequently sets the HCA's node description to HCA-1. This patch modifies openibd to add a @ instead of the hostname and adds a small hook in the core drivers to replace the @ sign with the system's utsname(). Because this patch depends on changes to openibd, it cannot be submitted to the upstream kernel, but it still corrects an outstanding issue with OFED Mike, The fact that you patch is both to user and kernel space code doesn't mean the kernel part can't be submitted upstream. I suggested you re post the patches to linux-rdma in a series made of two patches, one to the kernel and one to the service script. The kernel part then could be picked by the maintainer and will come into play once there's user space code plugging to it. This is similar to cases where people have kernel netlink agent code, merging is not dependent on the existence of specific matching user space code. As for the user space part, the IB stack provided by the distros does have a service script and this service script attempts to set the node descriptor, e.g here's the RHEL6 beta rdma service code # grep -A 13 node description /etc/rc.d/init.d/rdma # Add node description to sysfs IBSYSDIR=/sys/class/infiniband if [ -d ${IBSYSDIR} ]; then declare -i hca_id=1 for hca in ${IBSYSDIR}/* do if [ -w ${hca}/node_desc ]; then echo -n $(hostname | cut -f 1 -d .) HCA-${hca_id} ${hca}/node_desc 2 /dev/null fi let hca_id++ done fi errata_58 If you want to patch this, you can here open a bugzilla case with the relevant distro and propose a patch. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FW: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
Walukiewicz, Miroslaw wrote: The mckey works on UD_QP type and mcraw works on RAW_QP type. The data payload prepared for UD and RAW_QP are on different layers. The mckey uses rdma_join_multicast() that triggers a state machine for IB multicast joining. The mcraw does not trigger such state machine because for sending the ethernet multicast there is no need for any multicast joining state machine. The multicast destination address on ethernet is determined by multicast group address. Miroslaw, I tend not to agree with the entire set of your arguments, to start with, for example, the code issues IP_ADD_MEMBERSHIP call on a socket and also computes the actual L2 address derived from the L3 multicast address. If these ops are required for raw ethernet multicast operation, you can enhance the rdmacm to have rdma_join_multicast carry these ops similarly to what it does for PS_UDP and PS_IPOIB over IB (determine the L2 address, join through the relevant state machine, etc). Its not that you claim to be able to run raw multicast without any relation to the rdma-cm, e.g. you even want this code to be shipped by librdmacm... so lets understand the architecture, for example what port space is needed here, looking on the code, it looks like you want it to operate in the same manner as PS_IPOIB,so maybe extend this port space or declare an equivalent one for ethernet. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] dapl-2.0 - scm, ucm: add pkey, pkey_index, sl override for QP's
Hefty, Sean wrote: The index isn't guaranteed to be the same across all nodes. If a consumer is going to manually control this, they should really be forced to use the actual pkey. yes, I saw this confusion in action, for most users pkey index doesn't mean anything, it may also change across time, which can break scripts/setting to run specific jobs using specific partitions. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCH] pkey fix for ipoib - resubmission
Jason Gunthorpe wrote: Be aware that mainline and OFED are different in this regard, OFED overrides the pkey unconditionally for multicast addresses, while mainline doesn't Can you clarify this, please? ipoib bonding had much the same problem with invalid maddrs, and a patch was put in that flushed the maddr table in certain bond scenarios. Yes, reading through this thread, I tend to agree with Jason that we're in the same boat (problem) that used to be for bonding/ipoib and was fixed in commit 75c78500ddad74b229cd0691496b8549490496a2 bonding: remap multicast addresses without using dev_close() and dev_open(), so I assume a similar solution can/should be applied here as well, unless someone comes with a magic approach to eliminate the problem all together... Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mlx4 pci device table
Hi Yevgeny, Roland I wonder if you can spare few words what would be the correct location of the PCI Id table under the two tier architecture of the mlx4 driver? If the table is placed in mlx4_core (as of today in upstream), then I assume the mlx4_en and _ib aren't being probed by pci hot-plug mechasnisms, correct? else if you put it in _en _ib et al files, then one has to maintain two copies of the table, but maybe this would be the correct approach? how this should work with multi-protcol mlx4 devices and/or IBoE? Yevgeny, I see you placed in ofed some patch which isn't upstream who puts a copy or some modified clone of the table in mlx4_en, what the problem you were trying to solve? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] [PATCH] pkey fix for ipoib - resubmission
Jason Gunthorpe wrote: OFED works on kernels that have compiled-in inline'd multicast map functions that do not include the pkey copy, while mainline's multicast map functions do. So to work around this there is a bit of code in OFED to overwrite the pkey in the multicast hw address. This means on OFED with those kernels ip maddr returns the wrong hw address sometimes.. okay, got it. Anyway, with this not being the essence of the patch nor the discussion here, I would wait to hear what Todd and Mike think about your suggestion to apply the approach taken for the bonding problem and solution. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sysfs IPoIB root owned writable files
the following files created under /sys which are world writeable /sys/class/net/ib0/delete_child /sys/class/net/ib0/create_child At least the create_child delete_child files appear to be dangerous to leave as world writeable because they result in resources allocations. Roland, If I see a patch in linux-rdma patchwork, e.g https://patchwork.kernel.org/patch/104502 with the below patch, does this mean it will get to be reviewed/merged towards 2.6.36, or you prefer a reminder on the list? Or. Yes, this looks bad. The below patch fixes that, I tested it on 2.6.35-rc1 [PATCH] make ipoib child entries non-world writable Sumeet Lahorani sumeet.lahor...@oracle.com reported that the ipoib child entries are world writable, fix them to be root only writable Signed-off-by: Or Gerlitz ogerl...@voltaire.com diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index df3eb8c..b4b2257 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1163,7 +1163,7 @@ static ssize_t create_child(struct device *dev, return ret ? ret : count; } -static DEVICE_ATTR(create_child, S_IWUGO, NULL, create_child); +static DEVICE_ATTR(create_child, S_IWUSR, NULL, create_child); static ssize_t delete_child(struct device *dev, struct device_attribute *attr, @@ -1183,7 +1183,7 @@ static ssize_t delete_child(struct device *dev, return ret ? ret : count; } -static DEVICE_ATTR(delete_child, S_IWUGO, NULL, delete_child); +static DEVICE_ATTR(delete_child, S_IWUSR, NULL, delete_child); int ipoib_add_pkey_attr(struct net_device *dev) { -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mlx4 pci device table
Roland Dreier wrote: I think the current upstream location is correct. This matches the practice of eg iw_cxgb3 as well as cxgb3i, bnx2i etc. This does have the disadvantage that mlx4_en and mlx4_ib are not auto-loaded by PCI hotplug, but so it goes. okay. Still, its too bad that ofed ships patches that do things the other way around vs upstream. Yevgeny, if you have reasoning in place to do things the other way, why not submit upstream? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
miroslaw.walukiew...@intel.com wrote: adds a IB_QPT_RAW_PACKET QP type implementation for nes driver +++ b/drivers/infiniband/hw/nes/nes_ud.c +static const struct file_operations nes_ud_sksq_fops = { + .owner = THIS_MODULE, + .open = nes_ud_sksq_open, + .release = nes_ud_sksq_close, + .write = nes_ud_sksq_write, + .read = nes_ud_sksq_read, + .mmap = nes_ud_sksq_mmap, +}; + + +static struct miscdevice nes_ud_sksq_misc = { + .minor = MISC_DYNAMIC_MINOR, + .name = nes_ud_sksq, + .fops = nes_ud_sksq_fops, +}; Reading through the May 2010 RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver email thread, e.g at the below links, you say The non-bypass post_send/recv channel (using /dev/infiniband/rdma_cm) is shared with all other user-kernel communication and it is quite complex. It is a perfect path for QP/CQ/PD/mem management but for me it is too complex for traffic acceleration. The user-kernel path through additional driver, shared page for lkey/vaddr/len passing and SW memory translation in kernel is much more effective. http://marc.info/?l=linux-rdmam=127299659017928 http://marc.info/?l=linux-rdmam=127306694704653 I still don't see what is the performance issue with the uverbs post_send/post_recv and if there is such why it can't be fixed, to avoid introducing lib/driver nes special char device. Could you explain it with some more details? You were mention the rdma-cm device file, but the uverbs cmd api is used by libibverbs / uverbs and not by librdmacm / rdma-ucm, which is anyway a slow path. Also, I understand that .read (.write) entry maps to posting a receive (send) buffer, what is the use case for .mmap entry --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1139,7 +1141,6 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd, [...] - atomic_inc(qps_created); @@ -1405,10 +1406,122 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd, [...] + /* moved here to be sure that QP is really created */ + /*(now it counted a number of QP creation trials */ + atomic_inc(qps_created); best if this change and couple more of its such will be placed in a clean-up patch to nes_verbs.c, such that the amount of RAW QP related changes to review is minimized. @@ -2939,6 +3130,9 @@ int nes_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, nesqp-hwqp.qp_id, attr-qp_state, nesqp-ibqp_state, nesqp-iwarp_state, atomic_read(nesqp-refcount)); + if (ibqp-qp_type == IB_QPT_RAW_PACKET) + return 0; isn't a raw qp associated with a specific port of the device? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root owned writable files under /sys
Sumeet Lahorani wrote: # find /sys -type f -perm -222 /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1 Jack, Tziporet Can you clarify the status of the upstream kernel mlx4 multi-protocol support? looking on Linus git, I see one commit, 7ff93f8b7ecbc36e7ffc5c11a61643821c1bfee5 mlx4_core: Multiple port type support dated to Oct 2008, wheres ofed ships couple of patches touching this area, e.g adding the above sysfs entries. So what is the extra functionality introduced or bug/s fixed by those patches? any reason not to push them upstream? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When IBoE will be merged to upstream?
Liran Liss wrote: but keeping ib_create_ah() callable from any context is not a goal by itself. going with your approach, if your proposed design is accepted, I believe that you probably need to patch all the code-chains that makes calls under the current assumption I am looking for constructive ideas for supporting iboe without breaking Verbs/CQE/CM syntax. I don't agree that exposing the Ethernet L2 related information to the caller is breaking something, the converse, it is a required enhancement. I think we need to let resolve through the rdma-cm get to know at the consumer level, what are the source / destination macs, vlan id and vlan priority used by an IBoE QP, in the exact manner all the IB equivalents (src/dst lid, pkey, sl) are resolved by the rdma-cm and exposed to the consmer app for IB QP. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root owned writable files under /sys
Tziporet Koren wrote: Jack is on vacation and will be back in 2 weeks. I will ask him to look at this when he is back All this could have been much simpler if Yevgeny was responding, he's signed on the multi-protocol related patches shipped with ofed. So far, I had hard time getting responses form him on any of the notes I sent re mlx4_en and _core Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root owned writeable files under /sys
Roland Dreier rdre...@cisco.com wrote: thanks, applied I don't see it, and none of the other patches you accepted last night, in the for-next brach of yours, where are they...? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some dapl assistance
Davis, Arlin R wrote: There is limited debug in the non-debug builds. If you want full debugging capabilities you can install the source RPM and configure and make as follows [..] (OFED target example): okay, got that, once I built the sources by hand as you suggested I could see debug prints but things didn't really work, so I stepped back and installed the latest rpms - dapl-2.0.29-1 and compat-dapl-1.2.18-1, now I couldn't get intel-mpi to run: [r...@dodly0 ~]# rpm -qav | grep dapl dapl-utils-2.0.29-1 dapl-2.0.29-1 compat-dapl-1.2.18-1 [r...@dodly0 ~]# ldconfig -p | grep libdat libdat2.so.2 (libc6,x86-64) = /usr/lib64/libdat2.so.2 libdat.so.1 (libc6,x86-64) = /usr/lib64/libdat.so.1 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat.so.1 compat-dapl-1.2.18-1 [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat2.so.2 dapl-2.0.29-1 [r...@dodly0 ~]# /opt/intel/impi/4.0.0.027/intel64/bin/mpiexec -ppn 1 -n 2 -env DAPL_IB_PKEY 0x8002 -env DAPL_DBG_TYPE 0xff -env DAPL_DBG_DEST 0x3 -env I_MPI_DEBUG 3 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env I_MPI_FABRICS dapl:dapl /tmp/osu [0] MPI startup(): cannot open dynamic library libdat.so [1] MPI startup(): cannot open dynamic library libdat.so [0] MPI startup(): cannot open dynamic library libdat2.so [0] dapl fabric is not available and fallback fabric is not enabled [1] MPI startup(): cannot open dynamic library libdat2.so [1] dapl fabric is not available and fallback fabric is not enabled rank 1 in job 5 dodly0_54941 caused collective abort of all ranks exit status of rank 1: return code 254 rank 0 in job 5 dodly0_54941 caused collective abort of all ranks exit status of rank 0: return code 254 Any idea what we're doing wrong? BTW - before things stopped to work, exporting LD_DEBUG=libs to the MPI rank, I noticed that it used the compat-1.2 rpm ... Now, I can run dapltest fine, [r...@dodly0 ~]# dapltest -T S -D ofa-v2-mthca0-1 Dapltest: Service Point Ready - ofa-v2-mthca0-1 Dapltest: Service Point Ready - ofa-v2-mthca0-1 Server: Transaction Test Finished for this client [r...@dodly4 ~]# dapltest -T T -D ofa-v2-mlx4_0-1 -s dodly0 -i 1000 server SR 65536 4 client SR 65536 4 Server Name: dodly0 Server Net Address: 172.30.3.230 DT_cs_Client: Starting Test ... - Stats : 1 threads, 1 EPs Total WQE:2919.70 WQE/Sec Total Time : 0.68 sec Total Send : 262.14 MB - 382.69 MB/Sec Total Recv : 262.14 MB - 382.69 MB/Sec Total RDMA Read : 0.00 MB - 0.00 MB/Sec Total RDMA Write : 0.00 MB - 0.00 MB/Sec DT_cs_Client: == End of Work -- Client Exiting I also noted that the dapl-utils and the compat-dapl-utils are mutual exclusive as both attempt to install the same man page for dat.conf # rpm -Uvh /usr/src/redhat/RPMS/x86_64/compat-dapl-utils-1.2.18-1.x86_64.rpm Preparing...### [100%] file /usr/share/man/man5/dat.conf.5.gz from install of compat-dapl-utils-1.2.18-1.x86_64 conflicts with file from package dapl-utils-2.0.29-1.x86_64 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: root owned writable files under /sys
Jack Morgenstein wrote: The sysfs entries you refer to are introduced in commit 7ff93f8b7ecbc36e7ffc5c11a61643821c1bfee5 which patches in ofed but not upstream are you referring to? Hi Jack, I took another look, indeed the mlx4_port{1,2} sysfs entries are introduced in the commit you pointed on and their permissions looks okay (S_IRUGO | S_IWUSR), they are not world writable. As for the port_trigger sysfs entry, it is introduced by a patch shipped with ofed which isn't upstream (mlx4_1190_sense_port_trigger.patch) and indeed this entry is world writable. So the question here, if there's any reason for multi-protocol related patches such as this guy and its such not to be pushed upstream? I failed to get any constructive response (== pathces to Roland or Dave Miller) from Yevgeny and I was hoping you could be helpful here. Or. Sumeet Lahorani wrote: # find /sys -type f -perm -222 /sys/devices/pci:00/:00:04.0/:13:00.0/port_trigger /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port2 /sys/devices/pci:00/:00:04.0/:13:00.0/mlx4_port1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
I don't think there are applications around which would use raw qp AND are linked against libibverbs-1.0, such that they would exercise the 1_0 wrapper, so we can ignore the 1st allocation, the one at the wrapper code. As for the 2nd allocation, since a WQE --posting-- is synchronous, using the maximal values specified during the creation of the QP, I believe that this allocation can be done once per QP and used later. [...] Hi Mirek, any comment on my response to the NES patch you sent? Or. dive to kernel: ib_uverbs_post_send() user_wr = kmalloc(cmd.wqe_size, GFP_KERNEL); - 3. dyn alloc next = kmalloc(ALIGN(sizeof *next, sizeof (struct ib_sge)) + user_wr-num_sge * sizeof (struct ib_sge), GFP_KERNEL); - 4. dyn alloc And now there is finel call to driver. ~same here for #4 you can compute/allocate once the maximal possible size for next per qp and use it later. As for #3, this need further thinking. But before diving to all this design changes, what was the penalty introduced by these allocations? is it in packets-per-second, latency? Diving to kernel is treated as a something like passing signal to kernel that there is prepared information to post_send/post_recv. The information about buffers are passed through shared page (available to userspace through mmap) to avoid copying of data. Write() ops is used to passing signal about post_send. Read() ops is used to pass information about post_recv(). We avoid additional copying of the data that way. thanks for the heads-up, I took a look and this user/kernel shared memory page is used to hold the work-request, nothing to do with data. As for the work request, you still have to copy it in user space from the user work request to the library mmaped buffer. So the only difference would be the copy_from_user done by uverbs, for few tens of bytes, can you tell if/what is the extra penalty introduced by this copy? struct nes_ud_send_wr { u32 wr_cnt; u32 qpn; u32 flags; u32 resv[1]; struct ib_sge sg_list[64]; }; struct nes_ud_recv_wr { u32 wr_cnt; u32 qpn; u32 resv[2]; struct ib_sge sg_list[64]; }; Looking on struct nes_ud_send/recv_wr, I wasn't sure to follow, the same instance can be used to post list of work requests, where is work request is limited to use one SGE, am I correct? I don't think there a need to support posting 64 --send-- requests, for recv it might makes sense, but it could be done in a batch/background flow, thoughts? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver
Walukiewicz, Miroslaw wrote: I agree with you that it is possible to fix the post_send path in OFED. Let me think a few days yet. Hi Mirek, okay. Just one comment, the way I see it, ofed is very much not something that has post_send path, its a temporary, ad-hock, very far from being well organized, and actually much worse then you may think (try the archives for shovel in unreviewed junk or pile of shit) collection of bits which pretend to be a distribution of the Linux IB stack The credit or discredit and or questions, patches, bugs flames for this or that element of the IB stack, should all go to the maintainer/s. Specifically of libibverbs, ib_uverbs etc (happen to be CC-ed here...) Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
sense remote hardware address change by rdma-cm applications
Today, the kernel neighbouring maintainance state-machine / engine doesn't come into play for neighbours created on behalf of rdma-cm consumers. This is b/c the send path is offloaded away from the network-stack to the app QP, and as such the neighbour created follwing the ARP request / reply initiated by rdma_resolve_address is quickly getting aged and deleted, am I correct in that? This behaviour makes rdma-cm RC apps to sense remote hardware address change based only on the RC QP timeout, where UD apps have no way other then implementing some sort of keep-alive / probing mechanism to make sure their AH is valid, so how about A. ref a neighbour created on behalf of or used by an rdma-cm ID (*) B. enhance the rdma-cm address_change event to report on remote hardware address change, based on neighbour events Or. (*) would per ID neigh_hold() call (paired with neigh_release() when the ID gets destroyed) work for that end? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NULL pointer dereference in rdma_ucm
Josh England wrote: It may be that the in-kernel field cm_id_priv has a NULL -alt_av.port , causing the Oops, but I don't know for sure. Any ideas on how to debug this? seems like this was reported in the past but remained unsolved, http://lists.openfabrics.org/pipermail/general/2009-August/thread.html#61522 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sense remote hardware address change by rdma-cm applications
Jason Gunthorpe wrote: It is a bit wider problem than just ND entries, changes in routing can also alter the L2 address, so that needs to be tracked as well. sure, when we did the address change work, see commit dd5bdff RDMA/cma: Add RDMA_CM_EVENT_ADDR_CHANGE event, the problem I wanted to solve was related to the local bonding. Over the review thread, remote address change related to bonding fail-over and routing changes were mentioned, and left to future work. this is back to original criticisms from netdev of this whole separated stack idea - it isn't integrated, so where do you draw the line? What gets left out? Today, it is pretty clear that only the CM portion integrates at all with netdev and after that things are separate. the address change event was an attempt to make the CM part which integrates with netdev go a step further and help the data path which is offloaded to be more consistent with netdev, this email is about going another step. So.. I think to tackle this you need to start looking at how the dst_entry structure works in netdev and apply the same idea to RDMA-CM and reflect the changes in AH back to the QP owner. I can take a look (pointer would be very much appreciated...) still, the dst entry is used for every netdev xmit where here the xmit is offloaded, so I don't see what could be really used from the dst code, but I might be wrong. The rdma app uses the neighbour once, upon address resolving, and I was trying to see if we can ref the neighbour so the neigh sub-system probes would keep going even though the neighbour is not directly used. Is this an iwarp problem too? Not sure how L3-L2 translation works there. I never managed to understand how address resolving really works with iwarp... Doing a bit of detective work... you can see that addr4_resolve says /* If the device does ARP internally, return 'done' */ if (rt-idev-dev-flags IFF_NOARP) { rdma_copy_addr(addr, rt-idev-dev, NULL); goto put; } and later cma_connect_iw places into the iwarp cm the src/dst IP addresses sin = (struct sockaddr_in*) id_priv-id.route.addr.src_addr; cm_id-local_addr = *sin; sin = (struct sockaddr_in*) id_priv-id.route.addr.dst_addr; cm_id-remote_addr = *sin; so all the iwarp providers do ARP resolving in their TOE stack?! Steve, can you clarify that? Not sure what you do about UD.. Maybe RDMA-CM learns to do UC where the only action is to register notification monitors for L2 addressing changes in the kernel? The problem exists for all IB transports (even for RD, if it would have been implemented...), the only difference between the U and R onces is that for the R's, if the remote side vanished, eventually the IB HW would let you know on that in the form of CQ error. Can this be hidden with Sean's recent work on simplified progamming models? not sure how Sean's work relates to this proposed change. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-2.6.36] ib: fix some sparse warnings
fixed the following drivers/infiniband sparse pointed issues CHECK drivers/infiniband/hw/cxgb3/iwch_cm.c iwch_cm.c:140:5: warning: symbol 'iwch_l2t_send' was not declared. Should it be static? CHECK drivers/infiniband/hw/nes/nes_verbs.c nes_verbs.c:1944:45: warning: Using plain integer as NULL pointer nes_verbs.c:1944:48: warning: Using plain integer as NULL pointer CHECK drivers/infiniband/hw/nes/nes_cm.c nes_cm.c:2645:43: warning: mixing different enum types nes_cm.c:2645:43: int enum iw_cm_event_type versus nes_cm.c:2645:43: int enum iw_cm_event_status CHECK drivers/infiniband/ulp/iser/iser_initiator.c iser_initiator.c:173:5: warning: symbol 'iser_alloc_rx_descriptors' was not declared. Should it be static? Signed-off-by: Or Gerlitz ogerl...@voltaire.com I didn't address these two CHECK drivers/infiniband/hw/cxgb3/iwch_cq.c drivers/infiniband/hw/cxgb3/iwch_cq.c:192:9: warning: context imbalance in 'iwch_poll_cq_one' - different lock contexts for basic block CHECK drivers/infiniband/hw/cxgb3/iwch_qp.c drivers/infiniband/hw/cxgb3/iwch_qp.c:805:13: warning: context imbalance in '__flush_qp' - unexpected unlock diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c index ebfb117..3cdb535 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_cm.c +++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c @@ -137,7 +137,7 @@ static void stop_ep_timer(struct iwch_ep *ep) put_ep(ep-com); } -int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) +static int iwch_l2t_send(struct t3cdev *tdev, struct sk_buff *skb, struct l2t_entry *l2e) { int error = 0; struct cxio_rdev *rdev; diff --git a/drivers/infiniband/hw/nes/nes_cm.c b/drivers/infiniband/hw/nes/nes_cm.c index 986d6f3..98887af 100644 --- a/drivers/infiniband/hw/nes/nes_cm.c +++ b/drivers/infiniband/hw/nes/nes_cm.c @@ -2565,7 +2565,7 @@ static int nes_cm_disconn_true(struct nes_qp *nesqp) u16 last_ae; u8 original_hw_tcp_state; u8 original_ibqp_state; - enum iw_cm_event_type disconn_status = IW_CM_EVENT_STATUS_OK; + enum iw_cm_event_status disconn_status = IW_CM_EVENT_STATUS_OK; int issue_disconn = 0; int issue_close = 0; int issue_flush = 0; diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 9bc2d74..0df51a4 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -1941,7 +1941,7 @@ static int nes_reg_mr(struct nes_device *nesdev, struct nes_pd *nespd, u8 use_256_pbls = 0; u8 use_4k_pbls = 0; u16 use_two_level = (pbl_count_4k 1) ? 1 : 0; - struct nes_root_vpbl new_root = {0, 0, 0}; + struct nes_root_vpbl new_root = {0, NULL, NULL}; u32 opcode = 0; u16 major_code; diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c index 0b9ef07..95a08a8 100644 --- a/drivers/infiniband/ulp/iser/iser_initiator.c +++ b/drivers/infiniband/ulp/iser/iser_initiator.c @@ -170,7 +170,7 @@ static void iser_create_send_desc(struct iser_conn *ib_conn, } -int iser_alloc_rx_descriptors(struct iser_conn *ib_conn) +static int iser_alloc_rx_descriptors(struct iser_conn *ib_conn) { int i, j; u64 dma_addr; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: knockdown voltaire switch with ARP multicast
Bob Ciotti wrote: Maybe someone on the voltaire side can help. I'm working the issue now Wed Jul 21 00:34:14 PDT 2010 Hi Bob, I understand that some folks from Voltaire are working with you directly. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NULL pointer dereference in rdma_ucm
Josh England wrote: Do you think upgrading to OFED-1.5.1 would help at all? it might help you to diagnose the problem better, if you read through the thread I pointed on (its very short, four messages, let then two minutes), you would see that Arthur is reporting on the lap_state and Sean is suggesting to use the IB CM sysfs counter to further debug this. I don't know if these counters exist on the IB stack used for the ofed drop you're using, but they should be in 1.5.x Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sense remote hardware address change by rdma-cm applications
Steve Wise wrote: The cxgb3/4 drivers do not set IFF_NOARP and rely on ND being done as part of connection setup. The driver will initiate ND if there isn't a neigh entry available at the time the iwarp driver tries to send a SYN or SYN/ACK. okay, understood, thanks for clarifying this out. The cxgb* drivers actually reference the neigh and dst structs until the offload connection is gone. Also if the the offloaded connection has problems transmitting (due to a L2 address change, for example), then the driver will initiate ND again by calling neigh_event_send(). See t4_l2t_send_event() in l2t.c which is called by the iwarp driver in peer_abort() from iwch_cm.c when the HW tells us its retransmitting too much. In the general case of rdma-cm consumer, e.g IB RC based and/or UD unicast based, we don't have such feedback mechanism from the HW. As such, I would draw the line here around adopting into the rdma-cm the behavior of referencing the neigh and dst structures until the connection is gone (could you point on the func/path in drivers/net/cxgb3/l2t.c which does this? i wasn't sure). What doesn't happen is active positive feedback during the connection to avoid NUD. IE once the connection is setup, nobody calls dst_confirm() It is only called during connection setup/teardown. I think we can live with that, this is similar to the case of an app using UDP in uni-directional manner between host A -- B so the NUD part of the network stack @ host A has to issue timely probes to validate the L2 address of host B. The only difference is that we have the A -- B comm offloaded and eventually without keeping the ref the neighbour and dst are deleted, the proposed patch eliminates this deletion. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sense remote hardware address change by rdma-cm applications
Jason Gunthorpe wrote: I'm thinking something like this.. - The RDMA CM gets the dst from its route lookup locks it and stores it. - Instead of doing a route lookup cxgb gets the dst from RDMA CM, locks it and stores it - RDMA CM traps all notifications/etc and generates callback to cxgb to say the dst has changed. - cxgb releases the old dst and grabs the new one, updates the HW, etc. Jason, I'm up for extending the rdma-cm event of address change, on which an app can decide if to re-act or not. For example, the in-tree iser and rds code treat this event the same as a disconnection request arriving, which means higher layer (e.g the user space iscsi daemon in the iser case) would try to re-connect. This has the advantage of simplifying the ULP state-machine, so there's no need for special handing for address-change, just treat it as a hint that re-connection is needed. the cxgb* code take this deeper as they handle L2 changes in the driver level and not as event delivered to the ULP which can optionally address or ignore it. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with RDMA_CM on systems with multiple IB HCA's.
Hari Subramoni wrote: [subra...@amd6 perftest]$ ./ib_rdma_bw -c 172.16.1.5 11928: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=1 | 11928: Local address: LID , QPN 00, PSN 0x5bfbba RKey 0x90042602 VAddr 0x002b27feabe000 11928: Remote address: LID , QPN 00, PSN 0x392fe6, RKey 0xf8042605 VAddr 0x002b9d5c93b000 you can see the lid and qp numbers are zero, something is broken... when you use the rdma-cm, the address to be provided to the utility should be on an IPoIB subnet, is that what you're doing? Basically, I would suggest that you first use rping(1) provided by librdmacm-utils to make sure things are working well in your configuration and then move to the perftest utils. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with RDMA_CM on systems with multiple IB HCA's.
Hari Subramoni subra...@cse.ohio-state.edu wrote: The nodes have LID's assigned to them and OpenSM is running fine. I've attached the configurations of the two hosts along with this e-mail. As Jonathan mentioned, we are able to ping between them. are the two HCAs on each of the nodes connected to the same IB subnet? The issue is intermittent. It happens at times and at other times, things work fine. Please let us know if you need any more information. lets focus on rping, please use both -v -d flags with rping, also when rping fails, please send the neighbours info (#ip neigh show) from host .5 Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CMA handler status code
Eldad Zinger wrote: event.status = ib_event-param.sidr_rep_rcvd.status event.status = ib_event-param.rej_rcvd.reason event.status should be 0 for success, or negative value of generic error code. In that code, the error code is positive and do not comply with generic error code. Basically, I believe that the status equals reject reason for rdma-cm reject event is known to the kernel developers that deal with the rdma-cm. Personally, I'm fine with it, we could document that, but currently there's no rdma-cm document under Documentation/infiniband which could have this. For user space, I would add a comment in the man pages In order to make the status field available for other modules (like SDP), that field should be format-consistent. With SDP being out of tree for about four-six years (and counting), somehow hard to take into account claims related to it. Ot. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CMA handler status code
For user space, I would add a comment in the man pages [PATCH] librdmacm/man: document status field semantics for rejected event document status being the IB reject reason for RDMA_CM_EVENT_REJECTED event Signed-off-by: Or Gerlitz ogerl...@voltaire.com diff --git a/man/rdma_get_cm_event.3 b/man/rdma_get_cm_event.3 index 79bf606..91317c4 100644 --- a/man/rdma_get_cm_event.3 +++ b/man/rdma_get_cm_event.3 @@ -126,7 +126,8 @@ Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request. .IP RDMA_CM_EVENT_REJECTED Indicates that a connection request or response was rejected by the remote -end point. +end point. Under Infiniband, the event status field contains the reject reason +as provided by the IB CM. .IP RDMA_CM_EVENT_ESTABLISHED Indicates that a connection has been established with the remote end point. .IP RDMA_CM_EVENT_DISCONNECTED -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ib/mlx4: add IB_CQ_REPORT_MISSED_EVENTS support
enhance the cq arming code to support IB_CQ_REPORT_MISSED_EVENTS Signed-off-by: Or Gerlitz ogerl...@voltaire.com I noted that the IB_CQ_REPORT_MISSED_EVENTS flag was added in the same cycle with mlx4 and maybe as of this, mlx4 didn't implement the flag, which is used by IPoIB The patch is compile tested only, if the patch seems okay, I can conduct further testing. diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 5a219a2..4366811 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -755,6 +755,13 @@ int mlx4_ib_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags) to_mdev(ibcq-device)-uar_map, MLX4_GET_DOORBELL_LOCK(to_mdev(ibcq-device)-uar_lock)); + if (flags IB_CQ_REPORT_MISSED_EVENTS) { + struct mlx4_cqe *cqe; + cqe = next_cqe_sw(to_mcq(ibcq)); + if (cqe) + return 1; + } + return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ib/mlx4: add IB_CQ_REPORT_MISSED_EVENTS support
Eli Cohen wrote: returning 1 means that you must poll the CQ to avoid a race condition which is not true for mlx4. makes sense, thanks for clarifying that. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CMA handler status code
Hefty, Sean wrote: The original intent was to expose the transport specific status values to the user, rather than trying to map them. yes, this makes sense, are you okay with documenting that, e.g in the spirit of the patch I sent? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: InfiniBand/RDMA merge plans for 2.6.36
Walukiewicz, Miroslaw wrote: Hello Roland, What about a series from Aleksey Senin [...] And my patch RDMA/nes: IB_QPT_RAW_PACKET QP type support for nes driver https://patchwork.kernel.org/patch/110252 Hi Mirek, Reading your response @ http://marc.info/?l=linux-rdmam=127954552519544 to the comments made during the review, I was under the impression that you're going to try and modify the NES implementation, isn't this the case any more? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FW: [PATCH v2] rdma/ib_pack.h: add new bth opcodes
Robert Pearson wrote: Several new opcodes have been added since the last time ib_pack.h was updated. These changes add them. +++ b/include/rdma/ib_pack.h + IB_OPCODE_CN= 0x80, + IB_OPCODE_XRC = 0xA0, Is this tied to some IBA 1.2 existing/new annex? pointer would be appreciated Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] rdma/ib_pack.h: add new bth opcodes
Bob Pearson wrote: My interest is supporting the rxe driver, a software implementation of the IB transport over Ethernet, [...] I spent a little time looking at trying to exploit congestion notification to see if it would bu useful in this context. Hi Bob, As the IB congestion control / notification has the part of the IB switches marking packets with FECNs, I don't see how does IB CCA fits into IBoE scheme, Paul? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] rdma/ib_pack.h: add new bth opcodes
Bob Pearson wrote: I was wondering if I could use this to cause ConnectX RDMAoE senders to slow down in response to these packets. There is a challenge managing fast ROCE senders in networks that may not fully implement per priority pause. Hi Bob, QCN (IEEE 802.1 based Ethernet congestion control mechanism) can apply for IBoE traffic, in the same manner it would for FCoE, IP etc. Is there a specific reason you wanted to apply the IB mechanism and not use the Ethernet one? Yep, PFC is helpful. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/nes: double CLOSE event indication crash
Faisal Latif wrote: During a stress testing in a large cluster, multiple close event is detected and BUG() is hit in core. The cause is [...] Do you refer to the core of the IB stack? if not, to whose core? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using same IP subnet on multiple interfaces (was: dual HCAs with upstream kernel)
Hefty, Sean wrote: Does anyone have a system with multiple HCAs that's running a recent upstream kernel? Oracle has reported a bug connecting between two HCAs in the same system using the rdma_cm Sean, With 2.6.35, I was hitting the reported failure (address error event, status -ETIMEDOUT) with simpler configuration of two ports belonging to the same HCA. I used ucmatose and not rping as the former allows to specify local binding wheres the latter doesn't (see below). Next, I realized that similar test with ping(8) doesn't work either, the arp request was xmitted through one interface (ib0) and received on the other (ib1) but no reply was generated. At this point, I thought that maybe one of the arp/related sysctls could effect that, and I got an initial hit... following commit 8153a10, once I have set net.ipv4.conf.ib1.accept_local to 1 I could # ping -I ib0 to ib1's address where before that, I couldn't, ucmatose got to work either, no problem. commit 8153a10c08f1312af563bb92532002e46d3f504a Author: Patrick McHardy ka...@trash.net Date: Thu Dec 3 01:25:58 2009 + [...] Change fib_validate_source() to accept packets with a local source address when the accept_local sysctl is set for the incoming inet device. Combined with the previous patches, this allows to communicate between multiple local interfaces over the wire. # ip r s 192.168.20.0/24 dev ib0 proto kernel scope link src 192.168.20.1 192.168.20.0/24 dev ib1 proto kernel scope link src 192.168.20.100 before net.ipv4.conf.ib1.accept_local was set to 1, ping isn't working # ping -I ib0 192.168.20.100 -q # PING 192.168.20.100 (192.168.20.100) from 192.168.20.1 ib0: 56(84) bytes of data. # tcpdump -ni ib0 10:12:14.679101 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 10:12:15.679337 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 # tcpdump -ni ib1 10:13:35.798332 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 10:13:36.798569 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 # ip n s 192.168.20.100 dev ib0 INCOMPLETE after net.ipv4.conf.ib1.accept_local to 1, ping (and ucmatose) work, but # tcpdump -ni ib0 10:29:32.196866 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 10:29:32.197047 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 10:29:32.197058 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 33038, seq 1, length 64 10:29:32.197125 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 33038, seq 1, length 64 10:29:33.197013 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 33038, seq 2, length 64 # tcpdump -ni ib1 10:29:32.196920 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 10:29:32.196944 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 10:29:32.197029 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 10:29:32.197136 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 33038, seq 1, length 64 10:29:33.197023 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 33038, seq 2, length 64 10:29:34.197357 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 33038, seq 3, length 64 the echo requests go on the wire, the replies not, probably (...) internally, Patrick? I noted that the neighbour on the NIC which is replying quickly gets stale and later aged out # ip n s 192.168.20.100 dev ib0 lladdr 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8 REACHABLE 192.168.20.1 dev ib1 lladdr 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e7 STALE Or. This is my related configuration, I tried changing rp_filter to 0 but it didn't change things either # sysctl -a | grep accept_local | grep ib[0,1] net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib1.accept_local = 1 # sysctl -a | grep rp_ | grep ib[0,1] net.ipv4.conf.ib0.rp_filter = 1 net.ipv4.conf.ib0.arp_filter = 0 net.ipv4.conf.ib0.arp_announce = 0 net.ipv4.conf.ib0.arp_ignore = 1 net.ipv4.conf.ib0.arp_accept = 0 net.ipv4.conf.ib0.arp_notify = 0 net.ipv4.conf.ib0.proxy_arp_pvlan = 0 net.ipv4.conf.ib1.rp_filter = 1 net.ipv4.conf.ib1.arp_filter = 0 net.ipv4.conf.ib1.arp_announce = 0 net.ipv4.conf.ib1.arp_ignore = 1 net.ipv4.conf.ib1.arp_accept = 0 net.ipv4.conf.ib1.arp_notify = 0 net.ipv4.conf.ib1.proxy_arp_pvlan = 0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/nes: double CLOSE event indication crash
Latif, Faisal wrote: BUG() was in iw_cm.ko in its close handler mentioned as core in my email and caused by iw_nes.ko. I see, looks like iwcm.c accounts for most of the BUG* calls made from the core, could be nice to reduce them over time. Or. # grep -n BUG drivers/infiniband/core/*.c | grep ( drivers/infiniband/core/cma.c:1262: BUG_ON(1); drivers/infiniband/core/cm.c:1169: BUG_ON(cm_id-state != IB_CM_IDLE); drivers/infiniband/core/cm.c:1318: BUG_ON(!work); drivers/infiniband/core/device.c:175: BUG_ON(size sizeof (struct ib_device)); drivers/infiniband/core/device.c:194: BUG_ON(device-reg_state != IB_DEV_UNREGISTERED); drivers/infiniband/core/iwcm.c:120: BUG_ON(!list_empty(cm_id_priv-work_free_list)); drivers/infiniband/core/iwcm.c:163: BUG_ON(atomic_read(cm_id_priv-refcount)==0); drivers/infiniband/core/iwcm.c:165: BUG_ON(!list_empty(cm_id_priv-work_list)); drivers/infiniband/core/iwcm.c:186: BUG_ON(!list_empty(cm_id_priv-work_list)); drivers/infiniband/core/iwcm.c:241: BUG_ON(qp == NULL); drivers/infiniband/core/iwcm.c:298: BUG(); drivers/infiniband/core/iwcm.c:374: BUG(); drivers/infiniband/core/iwcm.c:397: BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, cm_id_priv-flags)); drivers/infiniband/core/iwcm.c:518: BUG_ON(cm_id_priv-state != IW_CM_STATE_CONN_RECV); drivers/infiniband/core/iwcm.c:583: BUG_ON(cm_id_priv-state != IW_CM_STATE_CONN_SENT); drivers/infiniband/core/iwcm.c:620: BUG_ON(iw_event-status); drivers/infiniband/core/iwcm.c:695: BUG_ON(cm_id_priv-state != IW_CM_STATE_CONN_RECV); drivers/infiniband/core/iwcm.c:723: BUG_ON(cm_id_priv-state != IW_CM_STATE_CONN_SENT); drivers/infiniband/core/iwcm.c:795: BUG(); drivers/infiniband/core/iwcm.c:824: BUG(); drivers/infiniband/core/iwcm.c:865: BUG_ON(atomic_read(cm_id_priv-refcount)==0); drivers/infiniband/core/iwcm.c:869: BUG_ON(!list_empty(cm_id_priv-work_list)); drivers/infiniband/core/mad.c:587: BUG_ON(!mad_list-mad_queue); drivers/infiniband/core/mad.c:1396: BUG_ON(!*method); drivers/infiniband/core/mad.c:1406: BUG_ON(*method); drivers/infiniband/core/mad.c:2242: BUG_ON(1); drivers/infiniband/core/verbs.c:91: BUG(); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using same IP subnet on multiple interfaces
Jason Gunthorpe wrote: [...] The socket that is bound to a device will then use its device for sending, but other sockets not bound to devices will do route lookups and use the lo device. Do: [...] To see the difference in each side. sure, makes sense, the ping-reply code does route lookup and will use the loopback device. I took a 2nd look on ping w.r.t to various sysctl states, and when rp_filter is set to its default # sysctl -a | grep -wE accept_local|rp_filter|arp_ignore | grep ib net.ipv4.conf.ib0.rp_filter = 1 net.ipv4.conf.ib0.accept_local = 1 net.ipv4.conf.ib0.arp_ignore = 1 net.ipv4.conf.ib1.rp_filter = 1 net.ipv4.conf.ib1.accept_local = 1 net.ipv4.conf.ib1.arp_ignore = 1 ping isn't working since there's no arp reply # ping -I ib0 192.168.20.100 PING 192.168.20.100 (192.168.20.100) from 192.168.20.1 ib0: 56(84) bytes of data. From 192.168.20.1 icmp_seq=2 Destination Host Unreachable From 192.168.20.1 icmp_seq=3 Destination Host Unreachable From 192.168.20.1 icmp_seq=4 Destination Host Unreachable # tcpdump -ni ib0 18:04:39.492306 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 18:04:40.492541 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 # tcpdump -ni ib1 18:04:42.497039 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 18:04:43.497268 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 Once I'm setting net.ipv4.conf.ib1.rp_filter=0 arps replies are generated and ping is working as you explained, echo-request externally, echo-reply internally # tcpdump -ni ib1 18:06:33.103248 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 18:06:33.103281 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 18:06:33.103369 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 18:06:33.103461 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 26906, seq 1, length 64 18:06:34.107465 IP 192.168.20.1 192.168.20.100: ICMP echo request, id 26906, seq 2, length 64 Now, If I return rp_filter to 1, ping keeps working using the neighbour previously created. ping even keeps working when I set net.ipv4.conf.ib1.accept_local to 0, which is a bit weird unless this sysctl is made to act in the neigbour level (i.e control arp replies and not any packet xmit). To really effect a full external loopback you need to have both sides bound to their respective devices. Note that binding to a device and binding to a source IP are not the same thing in Linux. Even without being fully into the details of what does binding to a source IP actually translates to, I understand there's a difference. In the RDMA CM case the listening side doesn't do any IP routing operations at all so a device bind isn't necessary. Yes, indeed. As for the active side, the RDMA CM doesn't have a BINDTODEVICE equivalent. As for the original issue we were discussing here, Sean - the conclusion is that with upstream 2.6.35 bits for the rdma connection to go from hca1 port1 to hca1 port2 (or from hca1 port1 to hca2 port1), the rdma-cm needs a neighbour, similarly to a ping -I ib0 to ib1 address. A neighbour isn't created unless the responding NIC (ib1 in my example) has both rp_filter set to 0 and accept_local set to 1, Jason, does this makes sense? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using same IP subnet on multiple interfaces
Jason Gunthorpe wrote: As for the original issue we were discussing here, the conclusion is that with upstream 2.6.35 bits for the rdma connection to go from hca1 port1 to hca1 port2 (or from hca1 port1 to hca2 port1), the rdma-cm needs a neighbour, similarly to a ping -I ib0 to ib1 address. A neighbour isn't created unless the responding NIC (ib1 in my example) has both rp_filter set to 0 and accept_local set to 1, does this makes sense? This description seemed reasonable to me. It is pretty confusing what binding means in RDMA CM, it is different then sockets, and is some combination of SO_BINDTODEVICE and bind to address. I was thinking that one of the things taken care by the patch set to addr.c/cma.c you, David and Sean did last year was to make binding in rdma-cm to be bind to address by-the-book, in what aspect it is different now? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] rdma/ib_pack.h: add new bth opcodes
Bob Pearson wrote: I was curious to see if I could force a ConnectX device to slow down from a remote application. But since the MADs have been crippled for IBOE there is no way to configure it. QP1 MADs are working for ConnectX, e.g the IB CM is fully functional for IBoE, and I don't think the mad layer was modified to emulate MADs for the CM over regular UD QP, UDP or their such, Eli, am I correct in that? For some reason the PMA (QP1 performance counters) service isn't exposed, but it should be working (and helpful) as well. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net-next pull request: RDS
Hi Andy, Some clarifications/questions from whatever quick look one can have over 107 patches... Zach Brown's RDS/IB: print IB event strings as well as their number - commit 1bde04a63d532c2540d6fdee0a661530a62b1686 in net-next-2.6 looks perfect to reside as a helper function in the core IB stack which can be in use by other rdma drivers (e.g ipoib, iser, srp, etc). Chris Mason's rds: recycle FMRs through lockless lists added net/rds/xlist.h - 6fa70da6081bbcf948801fd5ee0be4d98a43 adds net/rds/xlist.h - isn't this something that better be placed under include/linux/. etc? And last, your RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node() patch looks interesting, 1st, it has some macros which could be placed in more general locations e.g pcidev_to_node and ibdev_to_node, your significantly helps performance comment is interesting, I'll send a separate note about that to the rdma mailing list. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node()
Hi Andy, looking on this net-next-2.6 patch, I wonder if you can elaborate on your significantly helps performance comment - what improvement you see with this patch? What about the QP/CQ memory, are they better be placed in node-local to the HCA manner? Or. commit e4c52c98e04937ea87b0979a81354d0040d284f9 Author: Andy Grover andy.gro...@oracle.com Date: Fri Apr 23 10:49:53 2010 -0700 RDS/IB: add _to_node() macros for numa and use {k,v}malloc_node() Allocate send/recv rings in memory that is node-local to the HCA. This significantly helps performance. Signed-off-by: Andy Grover andy.gro...@oracle.com diff --git a/net/rds/ib.c b/net/rds/ib.c index 7a2131d..7d289d7 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -77,7 +77,7 @@ void rds_ib_add_one(struct ib_device *device) goto free_attr; } - rds_ibdev = kmalloc(sizeof *rds_ibdev, GFP_KERNEL); + rds_ibdev = kmalloc_node(sizeof *rds_ibdev, GFP_KERNEL, ibdev_to_node(device)); if (!rds_ibdev) goto free_attr; diff --git a/net/rds/ib.h b/net/rds/ib.h index c506604..4bc3e2f 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -3,6 +3,8 @@ #include rdma/ib_verbs.h #include rdma/rdma_cm.h +#include linux/pci.h +#include linux/slab.h #include rds.h #include rdma_transport.h @@ -167,6 +169,10 @@ struct rds_ib_device { spinlock_t spinlock; /* protect the above */ }; +#define pcidev_to_node(pcidev) pcibus_to_node(pcidev-bus) +#define ibdev_to_node(ibdev) pcidev_to_node(to_pci_dev(ibdev-dma_device)) +#define rdsibdev_to_node(rdsibdev) ibdev_to_node(rdsibdev-dev) + /* bits for i_ack_flags */ #define IB_ACK_IN_FLIGHT 0 #define IB_ACK_REQUESTED 1 diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 75eda9c..b5d0b60 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -347,7 +347,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn) goto out; } - ic-i_sends = vmalloc(ic-i_send_ring.w_nr * sizeof(struct rds_ib_send_work)); + ic-i_sends = vmalloc_node(ic-i_send_ring.w_nr * sizeof(struct rds_ib_send_work), + ibdev_to_node(dev)); if (!ic-i_sends) { ret = -ENOMEM; rdsdebug(send allocation failed\n); @@ -355,7 +356,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn) } memset(ic-i_sends, 0, ic-i_send_ring.w_nr * sizeof(struct rds_ib_send_work)); - ic-i_recvs = vmalloc(ic-i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work)); + ic-i_recvs = vmalloc_node(ic-i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work), + ibdev_to_node(dev)); if (!ic-i_recvs) { ret = -ENOMEM; rdsdebug(recv allocation failed\n); diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 7315fff..cc341cd 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -297,7 +297,7 @@ static struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev) rds_ib_flush_mr_pool(pool, 0); } - ibmr = kzalloc(sizeof(*ibmr), GFP_KERNEL); + ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, rdsibdev_to_node(rds_ibdev)); if (!ibmr) { err = -ENOMEM; goto out_no_cigar; @@ -376,7 +376,8 @@ static int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibm if (page_cnt fmr_message_size) return -EINVAL; - dma_pages = kmalloc(sizeof(u64) * page_cnt, GFP_ATOMIC); + dma_pages = kmalloc_node(sizeof(u64) * page_cnt, GFP_ATOMIC, +rdsibdev_to_node(rds_ibdev)); if (!dma_pages) return -ENOMEM; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rds-devel] net-next pull request: RDS
Andrew Grover wrote: Once net-next gets pushed to mainline and Roland pulls from that, then we'll be in a good position to put these helpers where they should go, and change other ULPs to use them. Andy, as Roland commented, you can push such helpers through Dave once Roland made a review of them, in a similar manner to a situation where an iser patch is merged by the iscsi maintainer, etc. if going to the review now... Roland - what's your take on the below patch net-next-2.6 and also on its net-next-2.6 RDS/IB: print IB event strings as well as their number 1bde04a63d532c2540d6fdee0a661530a62b1686 buddy? Or. commit 59f740a6aeb2cde2f79fe0df38262d4c1ef35cd8 Author: Zach Brown zach.br...@oracle.com Date: Tue Aug 3 13:52:47 2010 -0700 RDS/IB: print string constants in more places This prints the constant identifier for work completion status and rdma cm event types, like we already do for IB event types. A core string array helper is added that each string type uses. Signed-off-by: Zach Brown zach.br...@oracle.com diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index 8e3886d..bb6ad81 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -40,6 +40,15 @@ #include rds.h +char *rds_str_array(char **array, size_t elements, size_t index) +{ + if ((index elements) array[index]) + return array[index]; + else + return unknown; +} +EXPORT_SYMBOL(rds_str_array); + /* this is just used for stats gathering :/ */ static DEFINE_SPINLOCK(rds_sock_lock); static unsigned long rds_sock_count; diff --git a/net/rds/ib.h b/net/rds/ib.h index 2189fd4..7ad3d57 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -345,6 +345,7 @@ u32 rds_ib_ring_completed(struct rds_ib_work_ring *ring, u32 wr_id, u32 oldest); extern wait_queue_head_t rds_ib_ring_empty_wait; /* ib_send.c */ +char *rds_ib_wc_status_str(enum ib_wc_status status); void rds_ib_xmit_complete(struct rds_connection *conn); int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, unsigned int hdr_off, unsigned int sg, unsigned int off); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 0e2fea8..bc3dbc1 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -39,7 +39,8 @@ #include ib.h static char *rds_ib_event_type_strings[] = { -#define RDS_IB_EVENT_STRING(foo) [IB_EVENT_##foo] = __stringify(foo) +#define RDS_IB_EVENT_STRING(foo) \ + [IB_EVENT_##foo] = __stringify(IB_EVENT_##foo) RDS_IB_EVENT_STRING(CQ_ERR), RDS_IB_EVENT_STRING(QP_FATAL), RDS_IB_EVENT_STRING(QP_REQ_ERR), @@ -63,11 +64,8 @@ static char *rds_ib_event_type_strings[] = { static char *rds_ib_event_str(enum ib_event_type type) { - if (type ARRAY_SIZE(rds_ib_event_type_strings) - rds_ib_event_type_strings[type]) - return rds_ib_event_type_strings[type]; - else - return unknown; + return rds_str_array(rds_ib_event_type_strings, +ARRAY_SIZE(rds_ib_event_type_strings), type); }; /* diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c index a2f5f6f..e29e0ca 100644 --- a/net/rds/ib_recv.c +++ b/net/rds/ib_recv.c @@ -966,8 +966,9 @@ static inline void rds_poll_cq(struct rds_ib_connection *ic, struct rds_ib_recv_work *recv; while (ib_poll_cq(ic-i_recv_cq, 1, wc) 0) { - rdsdebug(wc wr_id 0x%llx status %u byte_len %u imm_data %u\n, -(unsigned long long)wc.wr_id, wc.status, wc.byte_len, + rdsdebug(wc wr_id 0x%llx status %u (%s) byte_len %u imm_data %u\n, +(unsigned long long)wc.wr_id, wc.status, +rds_ib_wc_status_str(wc.status), wc.byte_len, be32_to_cpu(wc.ex.imm_data)); rds_ib_stats_inc(s_ib_rx_cq_event); @@ -985,10 +986,11 @@ static inline void rds_poll_cq(struct rds_ib_connection *ic, } else { /* We expect errors as the qp is drained during shutdown */ if (rds_conn_up(conn) || rds_conn_connecting(conn)) - rds_ib_conn_error(conn, recv completion on - %pI4 had status %u, disconnecting and + rds_ib_conn_error(conn, recv completion on %pI4 had + status %u (%s), disconnecting and reconnecting\n, conn-c_faddr, - wc.status); + wc.status, + rds_ib_wc_status_str(wc.status)); } /* diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index 15f7569..808544a 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -38,6 +38,40 @@ #include rds.h #include
Re: gratuitous arps lost during IB switch failure
Sumeet sumeet.lahor...@oracle.com wrote: It turns out that this problem was being caused because we had multiple IPs configured on the bonded infiniband interface. It appears that grat. arps are being sent out for only one of those IPs. [...] Can the bonding driver be fixed to send out grat arps for both these IPs? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gratuitous arps lost during IB switch failure
Sumeet sumeet.lahor...@oracle.com wrote: It turns out that this problem was being caused because we had multiple IPs configured on the bonded infiniband interface. It appears that grat. arps are being sent out for only one of those IPs. [...] Can the bonding driver be fixed to send out grat arps for both these IPs? is there anything that makes you think this issue has something to do with ipoib/bonding? did you check with ethernet? the bonding driver isn't maintained over linux-rdma but rather over netdev. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
re: mlx4: propagate node_description changes down to FW
Hi Jack, I just came across this patch of yours which was placed in ofed 1.5.2, I didn't see any track of it here @ linux-rdma (any specific reason for that?) - some questions/issues to discuss - 1st and most, (say) for 1k node cluster, is it correct that for each node doing start/restart of the openibd service a trap will be sent to opensm and the latter will heavy sweep?! this doesn't sound very much scalable... have you tested it over large clusters? what was the impact? Or. mlx4: propagate node_description changes down to FW. The Node Description cannot be changed via MADs (it is read-only). Until now, it was changed in the driver, and the new Node Description was simply overwritten by the driver on MAD responses. The node description was modified in the driver by openibd via sysfs. However, that generated a race condition, where OpenSM could get the FW node description rather than the overwritten description if OpenSM queried the device before openibd had a chance to enter the new description. The solution is a new FW command (SET_NODE) which allows passing the new node description to FW. When this command is invoked, FW issues a 144 trap to OpenSM. Upon receiving this trap, OpenSM initiates a heavy sweep, thus updating the node description properly -- and eliminating the race. This patch works whether or not the new FW command is available. If SET_NODE is not available, things work as before. Fixes FM82320 Signed-off-by: Jack Morgenstein ja...@dev.mellanox.co.il Index: ofed_kernel/drivers/infiniband/hw/mlx4/main.c === --- ofed_kernel.orig/drivers/infiniband/hw/mlx4/main.c 2010-09-27 17:20:54.069787000 +0200 +++ ofed_kernel/drivers/infiniband/hw/mlx4/main.c 2010-09-27 17:21:15.07481 +0200 @@ -421,14 +421,34 @@ out: static int mlx4_ib_modify_device(struct ib_device *ibdev, int mask, struct ib_device_modify *props) { + struct mlx4_cmd_mailbox *mailbox; + int err; + if (mask ~IB_DEVICE_MODIFY_NODE_DESC) return -EOPNOTSUPP; - if (mask IB_DEVICE_MODIFY_NODE_DESC) { - spin_lock(to_mdev(ibdev)-sm_lock); - memcpy(ibdev-node_desc, props-node_desc, 64); - spin_unlock(to_mdev(ibdev)-sm_lock); - } + if (!(mask IB_DEVICE_MODIFY_NODE_DESC)) + return 0; + + spin_lock(to_mdev(ibdev)-sm_lock); + memcpy(ibdev-node_desc, props-node_desc, 64); + spin_unlock(to_mdev(ibdev)-sm_lock); + + /* if possible, pass node desc to FW, so it can generate +* a 144 trap. If cmd fails, just ignore. +*/ + mailbox = mlx4_alloc_cmd_mailbox(to_mdev(ibdev)-dev); + if (IS_ERR(mailbox)) + return 0; + + memset(mailbox-buf, 0, 256); + memcpy(mailbox-buf, props-node_desc, 64); + err = mlx4_cmd(to_mdev(ibdev)-dev, mailbox-dma, 1, 0, + MLX4_CMD_SET_NODE, MLX4_CMD_TIME_CLASS_A); + if (err) + mlx4_ib_dbg(SET_NODE command failed (%d), err); + + mlx4_free_cmd_mailbox(to_mdev(ibdev)-dev, mailbox); return 0; } Index: ofed_kernel/include/linux/mlx4/cmd.h === --- ofed_kernel.orig/include/linux/mlx4/cmd.h 2010-09-27 17:20:40.519054000 +0200 +++ ofed_kernel/include/linux/mlx4/cmd.h2010-09-27 17:21:15.081799000 +0200 @@ -58,6 +58,7 @@ enum { MLX4_CMD_SENSE_PORT = 0x4d, MLX4_CMD_HW_HEALTH_CHECK = 0x50, MLX4_CMD_SET_PORT= 0xc, + MLX4_CMD_SET_NODE= 0x5a, MLX4_CMD_ACCESS_DDR = 0x2e, MLX4_CMD_MAP_ICM = 0xffa, MLX4_CMD_UNMAP_ICM = 0xff9, Index: ofed_kernel/drivers/net/mlx4/cmd.c === --- ofed_kernel.orig/drivers/net/mlx4/cmd.c 2010-09-27 17:20:32.995814000 +0200 +++ ofed_kernel/drivers/net/mlx4/cmd.c 2010-09-27 17:21:15.088792000 +0200 @@ -242,8 +242,11 @@ static int mlx4_cmd_poll(struct mlx4_dev __raw_readl(hcr + HCR_OUT_PARAM_OFFSET + 4)); stat = be32_to_cpu((__force __be32) __raw_readl(hcr + HCR_STATUS_OFFSET)) 24; err = mlx4_status_to_errno(stat); - if (err) - mlx4_err(dev, command 0x%x failed: fw status = 0x%x\n, op, stat); + if (err) { + if (op != MLX4_CMD_SET_NODE || stat != CMD_STAT_BAD_OP) + mlx4_err(dev, command 0x%x failed: fw status = 0x%x\n, +op, stat); + } out: up(priv-cmd.poll_sem); @@ -296,8 +299,9 @@ static int mlx4_cmd_wait(struct mlx4_dev err = context-result; if (err) { - mlx4_err(dev, command 0x%x failed: fw status = 0x%x\n, -op, context-fw_status); + if (op != MLX4_CMD_SET_NODE ||
Re: mlx4: propagate node_description changes down to FW
Jack Morgenstein wrote: I have not yet submitted the patch to the list. sounds like its about time to do that... could you send this to review/merge into 2.6.37? From what was commented here and further looking, the sentence [...] Upon receiving this trap, OpenSM initiates a heavy sweep, thus updating the node description properly [...] isn't accurate, I suggest to change that into something like Upon receiving this trap, OpenSM issues SubnGet(NodeDescription) to the node that sent the trap thus updating the node description properly also, I guess Fixes FM82320 isn't meaningful for the upstream change log... Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Maximum size for memory registration (ibv_reg_mr)
Eli Cohen wrote: If you create a MR in kernel, it covers the entire address space and the HCA does not pose any limit since you do not consume MTTs. And if you use MTTs then the page size is a parameter in this calculation - huge page, regular page etc. I agree that the kernel case is not of large interest, even though what you wrote only applies for dma mr, when some FMR scheme is used, MTTs are consumed, ofcourse. But, typically, kernel code will not go to the order of giga-bytes, and in other words will not hit the HCA limit. Do leaving it as is seems to be the most accurate thing... I would implement it for regular pages and drop a note in the libibverbs man page that if huge pages are used (well, the huge pages patch set isn't fully merged, maybe its about time to make this happen...) then the actual limit is bigger, e.g follows the proportion between the regular to the huge pages used. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv10 0/11] IBoE support to Infiniband
Eli Cohen wrote: We have successfully tested MPI, SDP, RDS, and native Verbs applications over IBoE. I came across your ofed commit e5414cccaa13e6dd80d8d6fc3dafe95355facdef sdp: module parameter to disable SDP over ROCEE and wasn't sure what's behind it, can you clarify that? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv10 0/11] IBoE support to Infiniband
Amir Vadai wrote: It is from the days that SDP over RoCE wasn't stable. In addition, when customers had a very long delay before TCP connection established, in the following scenario: 1. in libsdp.conf, setting mode to 'both' (Try SDP and fallback to TCP) 2. application tries to open socket to a remote peer connected using 10G ethernet 3. Remote host don't support RoCE. It took few seconds till the CMA gave up trying to connect, and SDP connection failed, and TCP connection was established. thanks for the clarification re remote host not supporting IBoE, anyway, I don't see if/what this has to do with sdp stability, its just a delay on the connection establishment and you say the default is to be changed to off for this param. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Limit num of fast reg WRs
Eli Cohen e...@dev.mellanox.co.il wrote: Fix the limit of max fast regisreation WRs that can be posted to CX to match hardware capabilities. Guys, can you clarify if the hardware limitation is 511 entries or its (PAGE_SIZE / sizeof(pointer)) - 1 which is 4096 / 8 - 1 = 511 but can change if the page size gets bigger or smaller? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Work completions generated after a queue pair has made the transition to an error state
Bart Van Assche bvanass...@acm.org wrote: Has anyone been looking into this before ? nope, never ever, what hca is that? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Work completions generated after a queue pair has made the transition to an error state
Eli Cohen wrote: Completions with non-zero (error) status and a wr_id / opcode combination were received that were never queued by the application. In case of error the opcode of the completed operation is not provided. I am not sure why. Eli, there's nothing in the IB spec that mandates the WC.opcode of a non successful work request to be valid, the only WC fields that must be valid are the work-request ID (cookie) and the status code, I believe that hardware vendors would also make sure to have the vendor id valid... Bart, reading your initial posting, I was under the impression that the wr_id is something your app didn't post, so in that respect I take back my response, so, of-course, when you program to IB you can't assume anything on WC.opcode of an error-ed WR. Or. Note: some work requests were queued with and some without the flag IB_SEND_SIGNALED. I'm not sure however whether that has anything to do with the observed behavior. If you have WRs for which you did not set IB_SEND_SIGNALED, they are not considered completed before a comletion entry is pushed to the CQ that correspnds to that send queue. I am not sure if it means that all the WR in the send queue should be completed with error. This behavior is easy to reproduce. If I interpret the InfiniBand Architecture Specification correctly, this behavior is non-compliant. Has anyone been looking into this before ? I haven't seen it. It isn't supposed to happen. What hardware and software are you using and how do you reproduce it? Hello Ralph and Or, The way I reproduce that behavior is by modifying the state of a queue pair into IB_QPS_ERR while RDMA is ongoing. The application, which is multithreaded, performs RDMA by calling ib_post_recv() and ib_post_send() (opcodes IB_WR_SEND, IB_WR_RDMA_READ and IB_WR_RDMA_WRITE). This has been observed with the mlx4 driver, a ConnectX HCA and firmware version 2.7.0. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv10 08/12] mlx4: Add support for IBoE - address resolution
Eli Cohen e...@dev.mellanox.co.il wrote: [...] Address resolution is done atomically in the case of a link local address or a multicast GID and otherwise -EINVAL is returned. mlx4 transport packets were changed too to accommodate for IBoE. Multicast groups attach/detach calls dev_mc_add/remove to update the NIC's multicast filters. This change log and also I assume the patch as well, deals alot with multicast, however, patch 0/10 says With these patches, IBoE multicast frames may be broadcast as there is currently no use of a L2 multicast group membership protocol. - does this means some/much of the code added/changed by this patch is dead code or not needed at this point? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port
Eli Cohen wrote: On Sun, Oct 24, 2010 at 6:22 PM, Roland Dreier rdre...@cisco.com wrote: No you did not. It was there already but we never noticed before Yossi's patch. But AFAICT Yossi's patch (5eb620c8) went into 2.6.22 about 2.5 years ago... wasn't that already there way before the IBoE stuff started? I see... I think the reason it started failing comes from this portion of patch 8: I pulled/built/booted with the for-next branch of Roland's tree, and I can't get IB link for the node, I don't think this is my problem, since I'm on L2 IB and not Eth, but should this work with pre 2.7 firmware?! if not, maybe patch the mlx4 driver to print some error, # ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.6.818 node_guid: 0002:c903:0002:6be2 sys_image_guid: 0002:c903:0002:6be5 vendor_id: 0x02c9 vendor_part_id: 26418 hw_ver: 0xA0 board_id: MT_0A50110002 phys_port_cnt: 2 port: 1 state: PORT_INIT (2) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 port: 2 state: PORT_INIT (2) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 # dmesg mlx4_core: Mellanox ConnectX core driver v0.01 (May 1, 2007) mlx4_core: Initializing :0b:00.0 mlx4_core :0b:00.0: PCI INT A - GSI 30 (level, low) - IRQ 30 mlx4_core :0b:00.0: setting latency timer to 64 mlx4_core :0b:00.0: FW version 2.6.818 (cmd intf rev 3), max commands 16 mlx4_core :0b:00.0: Catastrophic error buffer at 0x1f020, size 0x10, BAR 0 mlx4_core :0b:00.0: FW size 385 KB mlx4_core :0b:00.0: Clear int @ f0058, BAR 0 mlx4_core :0b:00.0: Mapped 26 chunks/6168 KB for FW. mlx4_core :0b:00.0: BlueFlame available (reg size 512, regs/page 256) mlx4_core :0b:00.0: Base MM extensions: flags 0cc0, rsvd L_Key 0500 mlx4_core :0b:00.0: Max ICM size 4294967296 MB mlx4_core :0b:00.0: Max QPs: 16777216, reserved QPs: 64, entry size: 256 mlx4_core :0b:00.0: Max SRQs: 16777216, reserved SRQs: 64, entry size: 128 mlx4_core :0b:00.0: Max CQs: 16777216, reserved CQs: 128, entry size: 128 mlx4_core :0b:00.0: Max EQs: 512, reserved EQs: 4, entry size: 128 mlx4_core :0b:00.0: reserved MPTs: 16, reserved MTTs: 16 mlx4_core :0b:00.0: Max PDs: 8388608, reserved PDs: 4, reserved UARs: 1 mlx4_core :0b:00.0: Max QP/MCG: 8388608, reserved MGMs: 0 mlx4_core :0b:00.0: Max CQEs: 4194304, max WQEs: 16384, max SRQ WQEs: 16384 mlx4_core :0b:00.0: Local CA ACK delay: 15, max MTU: 4096, port width cap: 3 mlx4_core :0b:00.0: Max SQ desc size: 1008, max SQ S/G: 62 mlx4_core :0b:00.0: Max RQ desc size: 512, max RQ S/G: 32 mlx4_core :0b:00.0: Max GSO size: 131072 mlx4_core :0b:00.0: DEV_CAP flags: mlx4_core :0b:00.0: RC transport mlx4_core :0b:00.0: UC transport mlx4_core :0b:00.0: UD transport mlx4_core :0b:00.0: XRC transport mlx4_core :0b:00.0: FCoIB support mlx4_core :0b:00.0: SRQ support mlx4_core :0b:00.0: IPoIB checksum offload mlx4_core :0b:00.0: P_Key violation counter mlx4_core :0b:00.0: Q_Key violation counter mlx4_core :0b:00.0: Big LSO headers mlx4_core :0b:00.0: APM support mlx4_core :0b:00.0: Atomic ops support mlx4_core :0b:00.0: Address vector port checking support mlx4_core :0b:00.0: UD multicast support mlx4_core :0b:00.0: Router support mlx4_core :0b:00.0: IBoE support mlx4_core :0b:00.0: profile[ 0] ( CMPT): 2^26 entries @ 0x 0, size 0x 1 mlx4_core :0b:00.0: profile[ 1] (RDMARC): 2^21 entries @ 0x 1, size 0x 400 mlx4_core :0b:00.0: profile[ 2] ( MTT): 2^20 entries @ 0x 10400, size 0x 400 mlx4_core :0b:00.0: profile[ 3] (QP): 2^17 entries @ 0x 10800, size 0x 200 mlx4_core :0b:00.0: profile[ 4] ( ALTC): 2^17 entries @ 0x 10a00, size 0x80 mlx4_core :0b:00.0: profile[ 5] ( SRQ): 2^16 entries @ 0x 10a80, size 0x80 mlx4_core :0b:00.0: profile[ 6] (CQ): 2^16 entries @ 0x 10b00, size 0x80 mlx4_core :0b:00.0:
Re: can't get IB link with the for-next branch / IBoE patches (was mlx4: Fix unneeded return error...)
I pulled/built/booted with the for-next branch of Roland's tree, and I can't get IB link for the node, I don't think this is my problem, since I'm on L2 IB and not Eth, but should this work with pre 2.7 firmware?! if not, maybe patch the mlx4 driver to print some error, okay, I verified that with 2.6.36 this node gets IB link and IPoIB is working fine, so it must be something in or related to the for-next branch, I assume around the IBoE patches that touch mlx4 which make this failure to happen. With 2.6.36 I also see the awk: /etc/ofed/setup-mlx4.awk:6: (FILENAME=/etc/ofed/mlx4.conf FNR=21) fatal: cannot open file `/sbin/setup-mlx4' for reading (No such file or directory) warning when loading mlx4_ib, but it doesn't disruptive in the sense that the node works fine, IB wise. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't get IB link with the for-next branch / IBoE patches (was mlx4: Fix unneeded return error...)
I pulled/built/booted with the for-next branch of Roland's tree, and I can't get IB link for the node, I don't think this is my problem, since I'm on L2 IB and not Eth, but should this work with pre 2.7 firmware?! if not, maybe patch the mlx4 driver to print some error, okay, I verified that with 2.6.36 this node gets IB link and IPoIB is working fine, so it must be something in or related to the for-next branch, I assume around the IBoE patches that touch mlx4 which make this failure to happen. With 2.6.36 I also see the awk: /etc/ofed/setup-mlx4.awk:6: (FILENAME=/etc/ofed/mlx4.conf FNR=21) fatal: cannot open file `/sbin/setup-mlx4' for reading (No such file or directory) warning when loading mlx4_ib, but it doesn't disruptive in the sense that the node works fine, IB wise. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port
On Mon, Oct 25, 2010 at 1:34 PM, Eli Cohen e...@dev.mellanox.co.il wrote: IBoE will not work with firmware prior to 2.7.000. I don't think an error message is required in this case. But I'm on **IB** not IBoE, I don't think you mean that the Linux kernel IB stack is not functional over pre-2.7 firmware with the IBoE patches?! are you? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't get IB link with the for-next branch / IBoE patches
On Mon, Oct 25, 2010 at 6:17 PM, Eli Cohen e...@dev.mellanox.co.il wrote: On Mon, Oct 25, 2010 at 06:36:43AM -0700, Roland Dreier wrote: I suspect I broke either the UD header packing or the build_mlx_header function when I cleaned up the patches. I see the same problem, I'll take a look today. I think this will fix things up. The + operator has precedence over the ? operator so we end up with packet_length equal IB_GRH_BYTES / 4 which is wrong. Once you guys feel to have a fix I would be happy to give Roland's for-next bits some further basic kernel (e.g IB link up/down, IPoIB, running SM on a node with IBoE patches) testing and a bit of more advanced (e.g IB/iSER, IB/RDS [Andy]) testing to see that things are in place with L2 IB, I would recommend also the iWARP folks to do the same as the addr/rdma-cm modules were also modified. The merge window still has about 9 days, so we're okay with delaying the push in 1-2 days, thoughts people? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port
On Mon, Oct 25, 2010 at 4:36 PM, Eli Cohen e...@dev.mellanox.co.il wrote: Of course not. I just noticed that the IB link for IB link layer does come up, is that what you're seeing? No, I didn't have IB Link when I used the for-next bits -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx4: Fix unneeded return error in eth_link_query_port
On Mon, Oct 25, 2010 at 7:13 PM, Eli Cohen e...@dev.mellanox.co.il wrote: On Mon, Oct 25, 2010 at 06:46:39PM +0200, Or Gerlitz wrote: No, I didn't have IB Link when I used the for-next bits Can you summarize what is the problem that you're seeing? Eli, this is pretty simple, I do the following 1. pull/build/boot Roland's for-next 2. modprobe mlx4_ib -- the port state is INIT forever, is that clear? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't get IB link with the for-next branch / IBoE patches
Roland Dreier wrote: Yep, looks like that's where my cleanup broke things. I rolled this in and pushed it out; I'm testing it myself now. My IB port comes to active now, I think that fixed things. same here, I have IB port coming to active and basic IPoIB, opensm working okay on the node with the current for-next/IBoE bits Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't get IB link with the for-next branch / IBoE patches
I have IB port coming to active and basic IPoIB, opensm working okay on the node with the current for-next/IBoE bits doing a little bit stress testing, I came across the below oops, when running IPoIB and couple of iperf/udp sessions, it doesn't look like a problem in the IB stack. Also with rds, using rds-stress from rds-tools-1.5-1.el5 and rds-stress -s 192.168.20.18 -p 4000 -t 1 -q 1K -a 1K -D 1M on the client side, the node running the for-next/IBoE bits and acting as the passive side of the test, got hanged. Also here, this could be a bug in RDS and not in the IBoE patches, I know that the rds guys queued about a hundred! patches for 2.6.37 so with these patches things might be better. I have the oops trace in jpg, will send to Andy, Roland and Eli. I guess we can continue these tests for 2-3 days and have the push over the weekend, or push it before and get fixes if needed through the -rc cycle. Oct 26 12:36:30 nsg2 kernel: BUG: spinlock bad magic on CPU#0, iperf/20845 Oct 26 12:36:30 nsg2 kernel: lock: 81663ef8, .magic: , .owner: none/-1, .owner_cpu: 0 Oct 26 12:36:30 nsg2 kernel: Pid: 20845, comm: iperf Not tainted 2.6.36-rc5-42052-gce806e1 #1 Oct 26 12:36:30 nsg2 kernel: Call Trace: Oct 26 12:36:30 nsg2 kernel: [811542b7] ? do_raw_spin_lock+0x22/0x122 Oct 26 12:36:30 nsg2 kernel: [81268b2b] ? dev_queue_xmit+0x10d/0x346 Oct 26 12:36:30 nsg2 kernel: [8128ca13] ? ip_push_pending_frames+0x2bf/0x318 Oct 26 12:36:30 nsg2 kernel: [812a7e66] ? udp_push_pending_frames+0x2d2/0x351 Oct 26 12:36:30 nsg2 kernel: [812a970c] ? udp_sendmsg+0x4b0/0x59c Oct 26 12:36:30 nsg2 kernel: [8112e9f7] ? cap_socket_sendmsg+0x0/0x3 Oct 26 12:36:30 nsg2 kernel: [812e7d8e] ? common_interrupt+0xe/0x13 Oct 26 12:36:30 nsg2 kernel: [8112e9f7] ? cap_socket_sendmsg+0x0/0x3 Oct 26 12:36:30 nsg2 kernel: [81256bbb] ? sock_aio_write+0xf5/0x10d Oct 26 12:36:30 nsg2 kernel: [810029ae] ? reschedule_interrupt+0xe/0x20 Oct 26 12:36:30 nsg2 kernel: [812e7d8e] ? common_interrupt+0xe/0x13 Oct 26 12:36:30 nsg2 kernel: [812e7d8e] ? common_interrupt+0xe/0x13 Oct 26 12:36:30 nsg2 kernel: [810b9b49] ? do_sync_write+0xab/0xeb Oct 26 12:36:30 nsg2 kernel: [812e7abf] ? _raw_spin_unlock_irq+0x9/0xd Oct 26 12:36:30 nsg2 kernel: [8112e83f] ? security_file_permission+0x18/0x6b Oct 26 12:36:30 nsg2 kernel: [810ba1f7] ? vfs_write+0xbe/0x132 Oct 26 12:36:30 nsg2 kernel: [810ba754] ? sys_write+0x45/0x6e Oct 26 12:36:30 nsg2 kernel: [81001e6b] ? system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: can't get IB link with the for-next branch / IBoE patches
doing a little bit stress testing, I came across the below oops, when running IPoIB and couple of iperf/udp sessions, it doesn't look like a problem in the IB stack. To trigger this I run from client node the following iperf -uc 192.168.21.18 -l 64000 -t 72000 -i 1 -b 40g -d -P 4 where the server node (21.18 here) was the one that has the IBoE patches and got this oops Oct 26 12:36:30 nsg2 kernel: BUG: spinlock bad magic on CPU#0, iperf/20845 Oct 26 12:36:30 nsg2 kernel: lock: 81663ef8, .magic: , .owner: none/-1, .owner_cpu: 0 Oct 26 12:36:30 nsg2 kernel: Pid: 20845, comm: iperf Not tainted 2.6.36-rc5-42052-gce806e1 #1 Oct 26 12:36:30 nsg2 kernel: Call Trace: Oct 26 12:36:30 nsg2 kernel: [811542b7] ? do_raw_spin_lock+0x22/0x122 Oct 26 12:36:30 nsg2 kernel: [81268b2b] ? dev_queue_xmit+0x10d/0x346 Oct 26 12:36:30 nsg2 kernel: [8128ca13] ? ip_push_pending_frames+0x2bf/0x318 Oct 26 12:36:30 nsg2 kernel: [812a7e66] ? udp_push_pending_frames+0x2d2/0x351 Oct 26 12:36:30 nsg2 kernel: [812a970c] ? udp_sendmsg+0x4b0/0x59c Oct 26 12:36:30 nsg2 kernel: [8112e9f7] ? cap_socket_sendmsg+0x0/0x3 Oct 26 12:36:30 nsg2 kernel: [812e7d8e] ? common_interrupt+0xe/0x13 Oct 26 12:36:30 nsg2 kernel: [8112e9f7] ? cap_socket_sendmsg+0x0/0x3 Oct 26 12:36:30 nsg2 kernel: [81256bbb] ? sock_aio_write+0xf5/0x10d Oct 26 12:36:30 nsg2 kernel: [810029ae] ? reschedule_interrupt+0xe/0x20 Oct 26 12:36:30 nsg2 kernel: [812e7d8e] ? common_interrupt+0xe/0x13 Oct 26 12:36:30 nsg2 kernel: [812e7d8e] ? common_interrupt+0xe/0x13 Oct 26 12:36:30 nsg2 kernel: [810b9b49] ? do_sync_write+0xab/0xeb Oct 26 12:36:30 nsg2 kernel: [812e7abf] ? _raw_spin_unlock_irq+0x9/0xd Oct 26 12:36:30 nsg2 kernel: [8112e83f] ? security_file_permission+0x18/0x6b Oct 26 12:36:30 nsg2 kernel: [810ba1f7] ? vfs_write+0xbe/0x132 Oct 26 12:36:30 nsg2 kernel: [810ba754] ? sys_write+0x45/0x6e Oct 26 12:36:30 nsg2 kernel: [81001e6b] ? system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/13] IB-mgmt: Port madeye to userspace
Hefty, Sean wrote: [...] an alternative goal f these patches is to allow ibacm and similar applications to detect and react to SA and CM timeouts. Hi Sean, As far as I understand CM timeout is an event not a mad... when referring to detecting/reacting on CM timeouts, did you mean detecting mads like CM retries and reacting on them? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 0/3] New RAW_PACKET QP type
Aleksey Senin wrote: The following patches add a new QP type named RAW_PACKET. Is there anything different in this patch set compared to V1 of https://patchwork.kernel.org/patch/110153 or its just a repost? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 0/3] New RAW_PACKET QP type
Steve Wise wrote: I'm working on similar code for Chelsio that will use these QPs. Will the TX flow require going into kernel space or will be fully offloaded? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
re: Fix IPoIB to conform to ethtool definitions
Hi Eli, can this patch of yours which you placed in ofed be pushed upstream? Or. From 4237a1fbc1bae6bb86665f81cd93cfac37b216d2 Mon Sep 17 00:00:00 2001 From: Eli Cohen e...@mellanox.co.il Date: Wed, 3 Nov 2010 10:56:38 +0200 Subject: [PATCH] IPoIB: Fix IPoIB to conform to ethtool definitions Ethtool documentation states that when once of the parameters, rx_coalesce_usecs or rx_max_coalesced_frames are set to zero while the other has a none zero value, the none zero parameter should still be operative. For example, if rx_max_coalesced_frames is set to zero while rx_coalesce_usecs is 0, the rate of events is limited to not exceed (1 / rx_coalesce_usecs). In the opposite case, an event is generated only after rx_max_coalesced_frames have arrived. The documentation also states that setting both to zero is invalid. Signed-off-by: Eli Cohen e...@mellanox.co.il --- drivers/infiniband/ulp/ipoib/ipoib_ethtool.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c index e9795f6..e602b7f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c @@ -78,6 +78,13 @@ static int ipoib_set_coalesce(struct net_device *dev, coal-rx_max_coalesced_frames 0x) return -EINVAL; + if (coal-rx_max_coalesced_frames | coal-rx_coalesce_usecs) { + if (!coal-rx_max_coalesced_frames) + coal-rx_max_coalesced_frames = 0x; + else if (!coal-rx_coalesce_usecs) + coal-rx_coalesce_usecs = 0x; + } + ret = ib_modify_cq(priv-recv_cq, coal-rx_max_coalesced_frames, coal-rx_coalesce_usecs); if (ret ret != -ENOSYS) { -- 1.7.3.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IBoE fixes/enhancements
Hi Eli, are there known IBoE fixes which are in ofed but missed 2.6.37-rc1? Also, can the below and/or any other enhancements you've placed in ofed be pushed upstream? it would be great to have perf counters operating fine for IBoE Or. From 72c316b60f62401e031520fe3f55ec6879bbc42b Mon Sep 17 00:00:00 2001 From: Eli Cohen e...@mellanox.co.il Date: Wed, 6 Jan 2010 14:09:38 +0200 Subject: [PATCH 12/12] mlx4: add support for reading performance counters This patch uses basic or extended counters which can be read by a command interface, to report counters for all the QPs that work on an rdmaoe port. This effectively allows to implement performance counter ala IB. Signed-off-by: Eli Cohen e...@mellanox.co.il --- drivers/infiniband/hw/mlx4/mad.c | 86 - drivers/infiniband/hw/mlx4/main.c| 17 ++- drivers/infiniband/hw/mlx4/mlx4_ib.h |1 + drivers/infiniband/hw/mlx4/qp.c |2 + drivers/net/mlx4/fw.h|1 - drivers/net/mlx4/main.c | 22 +++-- include/linux/mlx4/cmd.h |4 ++ include/linux/mlx4/device.h | 36 ++ 8 files changed, 159 insertions(+), 10 deletions(-) Index: ofed_kernel-fixes/drivers/infiniband/hw/mlx4/mad.c === --- ofed_kernel-fixes.orig/drivers/infiniband/hw/mlx4/mad.c 2010-09-01 15:30:01.0 +0300 +++ ofed_kernel-fixes/drivers/infiniband/hw/mlx4/mad.c 2010-09-01 15:33:48.571462204 +0300 @@ -229,9 +229,9 @@ static void forward_trap(struct mlx4_ib_ } } -int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,u8 port_num, - struct ib_wc *in_wc, struct ib_grh *in_grh, - struct ib_mad *in_mad, struct ib_mad *out_mad) +static int ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) { u16 slid, prev_lid = 0; int err; @@ -299,6 +299,87 @@ int mlx4_ib_process_mad(struct ib_device return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; } +static __be32 be64_to_be32(__be64 b64) +{ + return cpu_to_be32(be64_to_cpu(b64) 0x); +} + +static void edit_counters(struct mlx4_counters *cnt, void *data) +{ + *(__be32 *)(data + 40 + 24) = be64_to_be32(cnt-tx_bytes); + *(__be32 *)(data + 40 + 28) = be64_to_be32(cnt-rx_bytes); + *(__be32 *)(data + 40 + 32) = be64_to_be32(cnt-tx_frames); + *(__be32 *)(data + 40 + 36) = be64_to_be32(cnt-rx_frames); +} + +static void edit_ext_counters(struct mlx4_counters_ext *cnt, void *data) +{ + *(__be32 *)(data + 40 + 24) = be64_to_be32(cnt-tx_uni_bytes); + *(__be32 *)(data + 40 + 28) = be64_to_be32(cnt-rx_uni_bytes); + *(__be32 *)(data + 40 + 32) = be64_to_be32(cnt-tx_uni_frames); + *(__be32 *)(data + 40 + 36) = be64_to_be32(cnt-rx_uni_frames); + *(__be32 *)(data + 40 + 8) = be64_to_be32(cnt-rx_err_frames); +} + +static int rdmaoe_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + struct mlx4_cmd_mailbox *mailbox; + struct mlx4_ib_dev *dev = to_mdev(ibdev); + int err; + u32 inmod = dev-counters[port_num - 1] 0x; + int mode; + +if (in_mad-mad_hdr.mgmt_class != IB_MGMT_CLASS_PERF_MGMT) + return -EINVAL; + + mailbox = mlx4_alloc_cmd_mailbox(dev-dev); + if (IS_ERR(mailbox)) + return IB_MAD_RESULT_FAILURE; + + err = mlx4_cmd_box(dev-dev, 0, mailbox-dma, inmod, 0, + MLX4_CMD_QUERY_IF_STAT, MLX4_CMD_TIME_CLASS_C); + if (err) + err = IB_MAD_RESULT_FAILURE; + else { + memset(out_mad-data, 0, sizeof out_mad-data); + mode = be32_to_cpu(((struct mlx4_counters *)mailbox-buf)-counter_mode) 0xf; + switch (mode) { + case 0: + edit_counters(mailbox-buf, out_mad-data); + err = IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; + break; + case 1: + edit_ext_counters(mailbox-buf, out_mad-data); + err = IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY; + break; + default: + err = IB_MAD_RESULT_FAILURE; + } + } + + mlx4_free_cmd_mailbox(dev-dev, mailbox); + + return err; +} + +int mlx4_ib_process_mad(struct ib_device *ibdev, int mad_flags,u8 port_num, + struct ib_wc *in_wc, struct ib_grh *in_grh, + struct ib_mad *in_mad, struct ib_mad *out_mad) +{ + switch
Re: Fix IPoIB to conform to ethtool definitions
Eli Cohen wrote: Sure, I was going to. I will send later today. I saw that you've dropped and implementation of inline/blue-flame sending for kernel space, what was the motivation is it sdp, rds or alike or something else? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IBoE fixes/enhancements
Eli Cohen wrote: I was going to send [...] upstream Also you had a fix to the port speed and something related to SL which I didn't understand, please send for review Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel inline sending / UARs / etc
Eli Cohen wrote: The idea is to let kernel consumers enjoy the improvements to latency that blue flame gives. And yes, SDP is motivating us but I am going to push to IPoIB too. From my recollection of numbers, for user space apps, using inline accounts for about 1us improvement in the latency, if this is indeed the case, I'm sure if there's great value here for kernel consumers, do you have any numbers to support this patch? I want to take the opportunity that you raised the issue to hear others opinion about changing the bitmap allocator maintain an avail variable that will count the number of available UARs. I want to use this to limit the number of UARs that a kernel consumer can allocate so that there will always be some available for userspace Is this correct that today all the kernels QPs use the same UAR, any problem with that? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel inline sending / UARs / etc
Or Gerlitz wrote: Eli Cohen wrote: From my recollection of numbers, for user space apps, using inline accounts for about 1us improvement in the latency, if this is indeed the case, I'm sure if there's great value here for kernel consumers, do you have any numbers to support this patch? I wanted to say that I'm NOT sure, sorry for the spam Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel inline sending / UARs / etc
Eli Cohen wrote: It indeed improves SDP's latency - I don't have exact numbers. the SDP number is very interesting (Amir, do you have it?) but irrelevant for upstream, any IPoIB numbers? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/13] IB-mgmt: Port madeye to userspace
Hefty, Sean wrote: CM mads aren't reliable, however they are retried. If a CM REQ does not receive a response after so many retries (usually 15), the REQ fails (status is timeout). The mad layer reports the timeout to the cm module. With snooping in place, a user will be notified that a mad send has failed and be given a copy of the mad. mmm, got that - I also see that ib_mad_send_wc has both the status and the content of the mad, upon which you base the design 3. ibacm returns a path record. The path record _may_ have come from cached data. 4. The librdmacm tries to establish a connection. 5. The kernel ib_cm module issues REQ. 6. The ib_mad module retries the REQ until it times out. 7. The mad timeout is reported to any users wishing to capture errors. In this example, the ibacm service would be registered and receive a copy of the failed REQ. The ibacm can look at the data in the REQ, see if it if has cached path record data which matches, and remove the cached data if so. 8. The librdmacm will see a connection failure. so the usage of mad snooping would be for cache invalidations, I wonder if registering on GID/MGID IN/OUT traps be sufficient for the same purpose? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel inline sending / UARs / etc
Eli Cohen wrote: For IPoIB it gives ~1 usec for improvement in latency. yep, this is what I expected, so over your testbed from what value to what value? also it would be important to note the change in the cpu utilization (e.g few vmstat 1 output lines before/after the change, while running IPoIB traffic) Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/13] IB-mgmt: Port madeye to userspace
Hefty, Sean wrote: That requires registration with the SA. The intent is to avoid using a centralized service when possible. yep, makes sense, look like this design finally went the decentralized way... cool Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ib receive completion error
Usha Srinivasan wrote: Can someone from Mellanox tell me what the vendor error 0x32 means? I am getting this error for wc.opcode 128 (IB_WC_RECV) wc.status 4 (IB_WC_LOC_PROT_ERR). I am running ofed 1.5.2 and am getting it on both rhel5 and sles11 You can't count on the wc.opcode when the status isn't success (0), and yes, we're also saw tons on ib0: failed recv event status=4, wrid=154 vend_err 32) errors when running iscsi/tcp over IPoIB stress from windows client to linux node acting as the iscsi target and using ofed 1.5.2, it starts to look like DoS attack is possible here. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API
Tom Tucker t...@ogc.us wrote: This patch changes the bus mapping logic to avoid page_address() where necessary Hi Tom, Does when necessary comes to say that invocations of page_address which remained in the code after this patch was applied are safe and no kmap call is needed? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ibacm: check for special handling of loopback requests
Ralph Campbell wrote: I guess what I'm objecting to is hard coding mlx4. I was trying to think of a way that would allow other HCAs to support the block loopback option in the future. It looks like ipoib sets IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK for kernel QPs but this isn't defined in libibverbs yet. It seems reasonable to add that feature some time in the future and change ibacm to use it. In the mean time, I guess I don't see an alternative to your patch. Ralph, Sean Its been couple of years since some folks from Voltaire tried to push this flag and the grounds for adding similar flags for QP creation, on the bright side, its there for kernel consumers where existing flags are LSO, block-multicast-loopback. On the somehow disappointing side, we didn't get that merged for user space. Basically, there was a claim on dependency with XRC patch set which also added flags for QPs, at some point, Ron Livne managed to introduce patch set which is independent of the XCR, see (*) below, but it didn't get in. As such one of our application teams pushed to ofed that mlx4 patch which sets this bit by default and the acm code is trying to identify and act upon its existence (**) Or. (*) latest post of the patch set 0/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4392994 1/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393054 2/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393004 3/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393014 4/4: https://kerneltrap.org/mailarchive/openfabrics-general/2008/12/11/4393024 (**) ofed patch http://git.openfabrics.org/git?p=ofed_1_5/linux-2.6.git;a=blob;f=kernel_patches/fixes/mlx4_0290_mcast_loopback.patch;h=786a3926529befac2c2d1fa6d8c36bada79d61a7;hb=HEAD http://git.openfabrics.org/git?p=ofed_1_5/linux-2.6.git;a=blob;f=kernel_patches/fixes/mlx4_0290_mcast_loopback.patch;h=786a3926529befac2c2d1fa6d8c36bada79d61a7;hb=HEAD -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ibacm: check for special handling of loopback requests
Hefty, Sean wrote: One could argue that this change is reasonable regardless of the OFED kernel patch. It avoids sending multicast traffic when the destination is local. The main drawback beyond the extra code is that a node can't send a multicast message to itself, with the intent that remote listeners will be able to cache the address data. Sean, To be precise, the bit avoids recieving multicast packets by the QP that --sent-- it, not by other QPs subscribed to that group on the same node/hca, the patch change-log even states that Inter QP multicast packets on the relevant HCA will still be delivered. You can test that with two mckey processes running on a node which has the patch active. So with acm the functionality you need is for the same QP to receive the packets it sent? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rdma_lat whos
Hi Ido, We came into a situation where running rdma_lat with vs with out the -c flag, which means w. or w.o using the rdma-cm introduces a notable ~1us difference in latency for 1k messages, that is ~3us w.o using rdma-cm and 3.9us when using the rdma-cm. I have reproduced that now with the latest code from your git tree and also with the RHEL provided package of perftest-1.2.3-1.el5, see the results below. Also, your tree is not available through the ofa git web service, Vlad, can you help set this out. Now, Jack, using this patch, Index: perftest/rdma_lat.c === --- perftest.orig/rdma_lat.c +++ perftest/rdma_lat.c @@ -666,7 +666,7 @@ static int pp_connect_ctx(struct pingpon { struct ibv_qp_attr attr = { .qp_state = IBV_QPS_RTR, - .path_mtu = IBV_MTU_256, + .path_mtu = IBV_MTU_2048, .dest_qp_num= data-rem_dest-qpn, .rq_psn = data-rem_dest-psn, .max_dest_rd_atomic = 1, I could get rdma_lat which doesn't use the rdma-cm, which means setting all the low-level QP params by the hand to produce the SAME result of 3.9us as with the rdma-cm, as you can see its one liner patch which uses higher MTU of 2048 vs the hard coded MTU of 256 used in the code. This is quite counter intuitive, for packets whose size is 256, correct? is there any known issue that can explain that?! The SA is convinced that 2048 (0x84) is the best MTU for that path, both nodes have ConnectX DDR with firmware 2.7.0 Or. # saquery -p --src-to-dst 1:14 Path record for 1 - 14 PathRecord dump: service_id..0x dgidfe80::8:f104:399:3c91 sgidfe80::2:c903:2:6be3 dlid0xE slid0x1 hop_flow_raw0x0 tclass..0x0 num_path_revers.0x80 pkey0x qos_class...0x0 sl..0x0 mtu.0x84 rate0x86 pkt_life0x92 preference..0x0 resv2...0x0 resv3...0x0 before the patch active side, w.o rdma-cm # rdma_lat 192.168.20.15 -s 1024 -n 1 26113:pp_client_connect: Couldn't connect to 192.168.20.15:18515 [r...@nsg1 ~]# rdma_lat 192.168.20.15 -s 1024 -n 1 local address: LID 0x0e QPN 0x1c004d PSN 0x3a3dca RKey 0x48002600 VAddr 0x0008a71400 remote address: LID 0x04 QPN 0x20004c PSN 0x27973 RKey 0x50042700 VAddr 0x001b724400 Latency typical: 3.01932 usec Latency best : 2.97582 usec Latency worst : 11.3183 usec passive side w.o rdma-cm # rdma_lat -s 1024 -n 1 local address: LID 0x04 QPN 0x20004c PSN 0x27973 RKey 0x50042700 VAddr 0x001b724400 remote address: LID 0x0e QPN 0x1c004d PSN 0x3a3dca RKey 0x48002600 VAddr 0x0008a71400 Latency typical: 3.02386 usec Latency best : 2.97436 usec Latency worst : 6.63569 usec active side, w.o rdma-cm # rdma_lat 192.168.20.15 -s 1024 -n 1 -c 26133: Local address: LID , QPN 00, PSN 0xa12538 RKey 0x50002600 VAddr 0x0013d27400 26133: Remote address: LID , QPN 00, PSN 0x5c01e8, RKey 0x58042700 VAddr 0x0006dbb400 Latency typical: 3.89977 usec Latency best : 3.83227 usec Latency worst : 13.6462 usec passive side, w.o rdma-cm # rdma_lat -s 1024 -n 1 -c 21826: Local address: LID , QPN 00, PSN 0x5c01e8 RKey 0x58042700 VAddr 0x0006dbb400 21826: Remote address: LID , QPN 00, PSN 0xa12538, RKey 0x50002600 VAddr 0x0013d27400 Latency typical: 3.89982 usec Latency best : 3.83082 usec Latency worst : 13.6974 usec after the patch, the result w.o -c and with MTU=2048 becomes 3.9us as well, /home/ogerlitz/linux/tools/perftest/rdma_lat 192.168.20.15 -s 1024 -n 1 local address: LID 0x0e QPN 0x3c004d PSN 0x14ff1e RKey 0x68002600 VAddr 0x0016c5d400 remote address: LID 0x04 QPN 0x40004c PSN 0xba137e RKey 0x70042700 VAddr 0x001f259400 Latency typical: 3.88327 usec Latency best : 3.80378 usec Latency worst : 8.27951 usec -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html