IB/iSER major problems with Linux 3.0 and Solaris targets
Hi list, hi Or, at the moment we use our kernel 3.0 ported OFA kernel modules from OFED-1.5.4 to make iSER work together with our kernel 3.0 ported open-iscsi 2.0.869 and Solaris 11 COMSTAR targets. Now, we've got the problem that ib_iser from in there ignores its debug_level parameter. /sys/module/ib_iser/parameters/debug_level shows 0 but it spams the whole kernel log with debug messages. Seems to be a code bug. We've also tested a 3.0.15 mainline kernel with in-tree IB modules together with the OFED-1.5.4 user-space and this has much better IPoIB performance than the kernel stuff from OFED. So, we want to use them instead, but there is the same problem with the iSER debug messages and iSER doesn't work together with our Solaris 11 COMSTAR targets. We've tested this with open-iscsi (2.0.872, commit 4323e342d2c9fb8ed7233ce855001c189ec55b23) user-space. TCP is O.K. but iSER reports an error while login attempt: iscsiadm: initiator reported error (11 - iSCSI PDU timed out) The sent PDUs from iscsid debugging are the same, but there is an IO page fault in the kernel log. I've attached the relevant part and the iscsid log. This looks interesting: iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57 Or, could you please investigate/explain? It is a pain that we need both: working iSER and IPoIB traffic with good performance. Cheers, Sebastian On 19/12/11 10:14, Sebastian Riemer wrote: Hi list, I've already sent this to the open-iscsi mailing list, but I guess this is more relevant for linux-rdma. Finally I've got IB/iSER running on Debian Squeeze with Linux kernel 3.0 smoothly. The problem was that we did not have the suitable OFED for our kernel and we did not use the open-iscsi from OFED. Kernel 3.0 is supported since OFED-1.5.4 from 2011-12-05. So, I've taken the 1.5.2-based stuff from Debian/Experimental and I've updated it to 1.5.4 from OFA. Then, I've noticed that Debian doesn't build ib_iser in the OFA kernel source and that they don't build the open-iscsi kernel/user-space code - I made it do so. The next problem was that open-iscsi kernel code in OFED-1.5.4 is for = 2.6.32 based RedHat distributions. I had to port the source from 2.6.30 to 3.0 due to kernel API changes. OFA even forgot libiscsi_tcp.[ch] in OFED-1.5.4. So, I had to import it from 2.6.30 mainline. I did so, because we wanted to compare TCP and iSER speed over InfiniBand. Our Solaris COMSTAR targets provide both. After fixing the kernel, there was still a problem in the open-iscsi 2.0.869 user-space from OFED. Some sysfs magic has changed - so that the iSCSI host number couldn't be found. After fixing that, it worked for me. Cheers, Sebastian Jan 11 12:53:25 pserver214 kernel: [ 716.518372] SCSI subsystem initialized Jan 11 12:53:25 pserver214 kernel: [ 716.521146] Loading iSCSI transport class v2.0-870. Jan 11 12:53:25 pserver214 kernel: [ 716.528756] iscsi: registered transport (tcp) Jan 11 12:53:30 pserver214 kernel: [ 721.903544] iscsi: registered transport (iser) Jan 11 12:54:46 pserver214 kernel: [ 797.537439] iser: iser_connect:connecting to: 10.1.24.204, port 0xbc0c Jan 11 12:54:46 pserver214 kernel: [ 797.563158] iser: iser_cma_handler:event 0 status 0 conn 880807b17a80 id 880807594400 Jan 11 12:54:46 pserver214 kernel: [ 797.566402] iser: iser_cma_handler:event 2 status 0 conn 880807b17a80 id 880807594400 Jan 11 12:54:46 pserver214 kernel: [ 797.579704] iser: iser_create_ib_conn_res:setting conn 880807b17a80 cma_id 880807594400: fmr_pool 88082426b400 qp 8807ed22aa00 Jan 11 12:54:46 pserver214 kernel: [ 797.586557] iser: iser_cma_handler:event 9 status 0 conn 880807b17a80 id 880807594400 Jan 11 12:54:46 pserver214 kernel: [ 797.787932] iser: iscsi_iser_ep_poll:ib conn 880807b17a80 rc = 1 Jan 11 12:54:46 pserver214 kernel: [ 797.788137] scsi0 : iSCSI Initiator over iSER, v.0.1 Jan 11 12:54:46 pserver214 kernel: [ 797.794249] iser: iscsi_iser_conn_bind:binding iscsi/iser conn 8808058deab8 8808058decc8 to ib_conn 880807b17a80 Jan 11 12:54:46 pserver214 kernel: [ 797.794710] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488000 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.794919] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488200 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.794998] iser: iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57 Jan 11 12:54:46 pserver214 kernel: [ 797.795006] connection1:0: detected conn error (1011) Jan 11 12:54:46 pserver214 kernel: [ 797.795338] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488100 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.795535] AMD-Vi: Event logged [IO_PAGE_FAULT device=03:00.0 domain=0x0013 address=0x06488040 flags=0x0050] Jan 11 12:54:46 pserver214 kernel: [ 797.795730] AMD-Vi:
Re: Send with immediate data completion
On Jan 3, 2012, at 12:35 PM, Atchley, Scott wrote: On Jan 3, 2012, at 11:55 AM, Hefty, Sean wrote: I have a question about a completion for a send with immediate data. The IB spec (1.2.1) only mentions that the WC's immediate data be present at the receiver. It is silent on the value on the sender at completion. It does say that it is only valid if the WC's immediate data indicator is set. Can you provide a section reference to the spec on the areas that you're looking at? Looking quickly, section 11.4.2.1 reads like immediate data should be available in either case. I've never checked imm data on the send wc. I'm just trying to determine if there's an issue in the spec that should be addressed, or if this is simply a bug in the hca/driver. There is the definition in the glossary: Immediate Data Data contained in a Work Queue Element that is sent along with the payload to the remote Channel Adapter and placed in a Receive Work Completion. Section 3.7.4 Transport Layer: The Immediate Data (IMMDT) field is optionally present in RDMA WRITE and SEND messages. It contains data that the consumer placed in the Send or RDMA Write request and the receiving QP will place that value in the current receive WQE. An RDMA Write with immediate data will consume a receive WQE even though the QP did not place any data into the receive buffer since the IMMDT is placed in a CQE that references the receive WQE and indicates that the WQE has completed. Section 11.4.1.1 Post Send Request has: Immediate Data Indicator. This is set if Immediate Data is to be included in the outgoing request. Valid only for Send or Write RDMA operations. 4-byte Immediate Data. Valid only for Send or Write RDMA operations. 11.4.2.1 Poll for Completion Immediate data indicator. This is set if immediate data is present. 4-byte immediate data. None specifically mention the sender's completion event. Sean, Any thoughts? Personally, I would like to have it in the send completion, but it might not be possible for all drivers to implement. If not, then the spec should be clarified. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Upstream support for multicast IBoE
On Wed, Nov 23, 2011 at 11:17:02PM +0200, Or Gerlitz wrote: Roland Dreier rol...@purestorage.com wrote: looks like the libmlx4 patches I applied from you are missing multicast AH support? Yes, when I reworked the patches for submission I went from easier (RC) to harder (ucast UD + Vlans) and eventually didn't nailed down the multicast front. I probably planned getting there on a later point but it remained behind, anyway, if you're okay to have the ad hoc MGID - ethernet address mapping as you commented on this thread, I can make the patch to add that. Is there any estimate on when we might see something like this upstream? Thanks, Shawn --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm: Get correct guid in case of multiple ports
Hi Alex, Let me start with how we encountered the problem: This problem came up when our customer was using a 2 port card with only one of the port active. opensm could not get the guid of the port that was active in daemon mode. On 13:36 Wed 05 Oct , Goldwyn Rodrigues wrote: In case of multiple ports and running in daemon mode, the active port is not selected because opt.guid is set to INVALID_GUID in main() but the check in get_port_guid is done against zero: if (port_guid == 0) { opt.guid is set to 0 by default. opt.guid is set to INVALID_GUID if a user used -g WRONG_GUID command line option when executing the SM. What happens when -g 0 -B is specified? Check the getopt code. It sets guid to INVALID_GUID. Consider /etc/sysconfig/opensm as well. What happens when you provide -g WRONG_GUID -B? I think in this case, -B should take priority and set with the first active port available. In that case, when SM runs not in daemon mode, SM prompts the user to choose available port GUID out of available range. In case when SM runs in daemon mode, it can't prompt the user so it just exits. On second thoughts, passing port_guid is worthless because this function is called only when no guid is supplied at the command prompt. So, removed the port_guid parameter from the function altogether. If not in daemon mode, it would show the list of ports as intended. Also added error message if no ports are found. Signed-off-by: Goldwyn Rodrigues rgold...@suse.de diff --git a/opensm/main.c b/opensm/main.c index 51c8291..a236859 100644 --- a/opensm/main.c +++ b/opensm/main.c @@ -403,7 +403,7 @@ static void show_usage(void) exit(2); } -static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) +static ib_net64_t get_port_guid(IN osm_opensm_t *p_osm) { ib_port_attr_t attr_array[MAX_LOCAL_IBPORTS]; uint32_t num_ports = MAX_LOCAL_IBPORTS; @@ -436,21 +436,19 @@ static ib_net64_t get_port_guid(IN osm_opensm_t * p_osm, uint64_t port_guid) cl_hton64(attr_array[0].port_guid)); return attr_array[0].port_guid; } - /* If port_guid is 0 - use the first connected port */ - if (port_guid == 0) { + /* If in daemon mode autoselect first available port */ + if (p_osm-subn.opt.daemon) { for (i = 0; i num_ports; i++) if (attr_array[i].link_state IB_LINK_DOWN) break; + /* No port found which is available */ if (i == num_ports) - i = 0; + return 0; printf(Using default GUID 0x% PRIx64 \n, cl_hton64(attr_array[i].port_guid)); return attr_array[i].port_guid; } - if (p_osm-subn.opt.daemon) - return 0; - /* More than one possible port - list all ports and let the user * to choose. */ while (1) { @@ -1106,10 +1104,12 @@ int main(int argc, char *argv[]) then get a port GUID value with which to bind. */ if (opt.guid == 0 || cl_hton64(opt.guid) == CL_HTON64(INVALID_GUID)) - opt.guid = get_port_guid(osm, opt.guid); + opt.guid = get_port_guid(osm); - if (opt.guid == 0) + if (opt.guid == 0) { + printf(\nError: No available ports\n); goto Exit; + } status = osm_opensm_bind(osm, opt.guid); if (status != IB_SUCCESS) { -- Goldwyn -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- Alex -- Goldwyn -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Send with immediate data completion
Any thoughts? Personally, I would like to have it in the send completion, but it might not be possible for all drivers to implement. If not, then the spec should be clarified. From my interpretation of the spec, it looks like a bug and the immediate data should be available in the send wc, since it's not explicitly excluded. I'll verify that this is the intent with the IBTA. I haven't traced through the library and driver code to see if mlx4 driver can report it. I'll see if I can do this within the next couple of days. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-3.3 0/6] IB: set of fixes and additions
The series is made of four elements This subset (patches 1-3) fixes two bugs port info related bugs, where the first patch is just making following code related to speed possible... [PATCH 1/6] IB: use central enum for speed instead of hard-coded values [PATCH 2/6] IB/core: fix wrong display of rate in sysfs [PATCH 3/6] IB/mlx4: fix wrong info returned when querying IBoE ports This patch comes to address the long pending patch for 4k mtu along the lines of what was requested over the last review [PATCH 4/6] net/mlx4: allow for dynamic mtu configuration for IB ports This patch is simple and comes to unify the approach for checksum indication with possible / hopefully soon changes in libibverbs [PATCH 5/6] IB: change CQ TCP/IP checksum offload mark to use bit flag indication instead of integer This patch extends the IPoIB child scheme, and will serve as a building block when using IPoIB in the Virtualization space. [PATCH 6/6] IB/ipoib: add support for clones / multiple childs on the same partition Roland, Yep, this submission comes a bit into the 3.3 merge cycle, but with its fairly limited scope, I hope it can make it there... as of the recent changes in mlx4_core, patches 3 4 need you to pull Linus tree, but this should be done now anyway for any future patches to the IB stack through out the 3.3 cycle and elsewhere Or Gerlitz (6): IB: use central enum for speed instead of hard-coded values IB/core: fix wrong display of rate in sysfs IB/mlx4: fix wrong info returned when querying IBoE ports net/mlx4: allow for dynamic mtu configuration for IB ports IB: change CQ TCP/IP checksum offload mark to use bit flag indication instead of integer IB/ipoib: add support for clones / multiple childs on the same partition Documentation/infiniband/ipoib.txt | 24 + drivers/infiniband/core/sysfs.c | 25 -- drivers/infiniband/hw/amso1100/c2_provider.c |2 +- drivers/infiniband/hw/cxgb3/iwch_provider.c |2 +- drivers/infiniband/hw/cxgb4/provider.c |2 +- drivers/infiniband/hw/ehca/ehca_hca.c|2 +- drivers/infiniband/hw/mlx4/cq.c |3 +- drivers/infiniband/hw/mlx4/main.c| 107 +--- drivers/infiniband/hw/mthca/mthca_cq.c |3 +- drivers/infiniband/hw/nes/nes_verbs.c|2 +- drivers/infiniband/hw/qib/qib_rc.c |1 - drivers/infiniband/hw/qib/qib_uc.c |1 - drivers/infiniband/ulp/ipoib/ipoib.h |9 ++- drivers/infiniband/ulp/ipoib/ipoib_ib.c |3 +- drivers/infiniband/ulp/ipoib/ipoib_main.c| 48 --- drivers/infiniband/ulp/ipoib/ipoib_vlan.c| 40 ++--- drivers/net/ethernet/mellanox/mlx4/main.c| 118 +- drivers/net/ethernet/mellanox/mlx4/mlx4.h|2 + drivers/net/ethernet/mellanox/mlx4/port.c| 25 +- include/linux/mlx4/device.h |1 + include/rdma/ib_verbs.h | 11 ++- 21 files changed, 330 insertions(+), 101 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] IB: use central enum for speed instead of hard-coded values
The kernel IB stack uses one enumeration for IB speed, which wasn't explicitly specified in the verbs header file. Add that enum, and use it all over the code. Note that the IB speed/width notation is also used by iWARP and IBoE hw drivers who apply the convention of rate = speed X width, to advertize their port link rate. Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/sysfs.c | 15 +-- drivers/infiniband/hw/amso1100/c2_provider.c |2 +- drivers/infiniband/hw/cxgb3/iwch_provider.c |2 +- drivers/infiniband/hw/cxgb4/provider.c |2 +- drivers/infiniband/hw/ehca/ehca_hca.c|2 +- drivers/infiniband/hw/mlx4/main.c| 10 +- drivers/infiniband/hw/nes/nes_verbs.c|2 +- include/rdma/ib_verbs.h |9 + 8 files changed, 28 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index c61bca3..9ce70ca 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -189,21 +189,24 @@ static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, rate = (25 * attr.active_speed) / 10; switch (attr.active_speed) { - case 2: + case IB_SPEED_SDR: + speed = SDR; + break; + case IB_SPEED_DDR: speed = DDR; break; - case 4: + case IB_SPEED_QDR: speed = QDR; break; - case 8: + case IB_SPEED_FDR10: speed = FDR10; rate = 10; break; - case 16: + case IB_SPEED_FDR: speed = FDR; rate = 14; break; - case 32: + case IB_SPEED_EDR: speed = EDR; rate = 25; break; @@ -214,7 +217,7 @@ static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, return -EINVAL; return sprintf(buf, %d%s Gb/sec (%dX%s)\n, - rate, (attr.active_speed == 1) ? .5 : , + rate, (attr.active_speed == IB_SPEED_SDR) ? .5 : , ib_width_enum_to_int(attr.active_width), speed); } diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index 12f923d..07eb3a8 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -94,7 +94,7 @@ static int c2_query_port(struct ib_device *ibdev, props-pkey_tbl_len = 1; props-qkey_viol_cntr = 0; props-active_width = 1; - props-active_speed = 1; + props-active_speed = IB_SPEED_SDR; return 0; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 37c224f..0bdf09a 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -1227,7 +1227,7 @@ static int iwch_query_port(struct ib_device *ibdev, props-gid_tbl_len = 1; props-pkey_tbl_len = 1; props-active_width = 2; - props-active_speed = 2; + props-active_speed = IB_SPEED_DDR; props-max_msg_sz = -1; return 0; diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c index 247fe70..be1c18f 100644 --- a/drivers/infiniband/hw/cxgb4/provider.c +++ b/drivers/infiniband/hw/cxgb4/provider.c @@ -329,7 +329,7 @@ static int c4iw_query_port(struct ib_device *ibdev, u8 port, props-gid_tbl_len = 1; props-pkey_tbl_len = 1; props-active_width = 2; - props-active_speed = 2; + props-active_speed = IB_SPEED_DDR; props-max_msg_sz = -1; return 0; diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c index 73edc36..9ed4d25 100644 --- a/drivers/infiniband/hw/ehca/ehca_hca.c +++ b/drivers/infiniband/hw/ehca/ehca_hca.c @@ -233,7 +233,7 @@ int ehca_query_port(struct ib_device *ibdev, props-phys_state = 5; props-state = rblock-state; props-active_width= IB_WIDTH_12X; - props-active_speed= 0x1; + props-active_speed= IB_SPEED_SDR; } query_port1: diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 7b445df..6ff6bdf 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -215,16 +215,16 @@ static int ib_link_query_port(struct ib_device *ibdev, u8 port, switch (ext_active_speed) { case 1: - props-active_speed = 16; /* FDR */ + props-active_speed = IB_SPEED_FDR; break; case 2: - props-active_speed = 32; /* EDR */ +
[PATCH 2/6] IB/core: fix wrong display of rate in sysfs
commit 71eeba16 IB: Add new InfiniBand link speeds introduced a bug under which the rate for IB SDR/4X links was displayed as 8.5Gbs instead of 10Gbs, fix that. Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/sysfs.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index 9ce70ca..0b3edc0 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -213,12 +213,18 @@ static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, } rate *= ib_width_enum_to_int(attr.active_width); + if (rate 0) return -EINVAL; - return sprintf(buf, %d%s Gb/sec (%dX%s)\n, - rate, (attr.active_speed == IB_SPEED_SDR) ? .5 : , - ib_width_enum_to_int(attr.active_width), speed); + if (attr.active_speed == IB_SPEED_SDR) { + rate = (25 * ib_width_enum_to_int(attr.active_width)) / 10; + return sprintf(buf, %d%s Gb/sec (%dX%s)\n, rate, + (attr.active_width == IB_WIDTH_1X) ? .5 : , + ib_width_enum_to_int(attr.active_width), speed); + } else + return sprintf(buf, %d Gb/sec (%dX%s)\n, + rate, ib_width_enum_to_int(attr.active_width), speed); } static ssize_t phys_state_show(struct ib_port *p, struct port_attribute *unused, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] IB/mlx4: fix wrong info returned when querying IBoE ports
To issue port query, use the QUERY_(Ethernet)_PORT command instead of MAD_IFC command as the latter attempts to query the firmware IB SMA agent, which is irrelevant for IBoE ports. This allows to support both 10Gb/s and 40Gb/s rates (e.g in sysfs), using QDR speed (10Gb/s) and width of 1X or 4X. Signed-off-by: Dotan Barak dot...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/hw/mlx4/main.c | 97 - 1 files changed, 53 insertions(+), 44 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 6ff6bdf..85f6048 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -182,12 +182,27 @@ mlx4_ib_port_link_layer(struct ib_device *device, u8 port_num) } static int ib_link_query_port(struct ib_device *ibdev, u8 port, - struct ib_port_attr *props, - struct ib_smp *in_mad, - struct ib_smp *out_mad) + struct ib_port_attr *props) { + struct ib_smp *in_mad = NULL; + struct ib_smp *out_mad = NULL; int ext_active_speed; - int err; + int err = -ENOMEM; + + in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); + out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); + if (!in_mad || !out_mad) + goto out; + + init_query_mad(in_mad); + in_mad-attr_id = IB_SMP_ATTR_PORT_INFO; + in_mad-attr_mod = cpu_to_be32(port); + + err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, + in_mad, out_mad); + if (err) + goto out; + props-lid = be16_to_cpup((__be16 *) (out_mad-data + 16)); props-lmc = out_mad-data[34] 0x7; @@ -234,15 +249,17 @@ static int ib_link_query_port(struct ib_device *ibdev, u8 port, err = mlx4_MAD_IFC(to_mdev(ibdev), 1, 1, port, NULL, NULL, in_mad, out_mad); if (err) - return err; + goto out; /* Checking LinkSpeedActive for FDR-10 */ if (out_mad-data[15] 0x1) props-active_speed = IB_SPEED_FDR10; } } - - return 0; +out: + kfree(in_mad); + kfree(out_mad); + return err; } static u8 state_to_phys_state(enum ib_port_state state) @@ -251,32 +268,42 @@ static u8 state_to_phys_state(enum ib_port_state state) } static int eth_link_query_port(struct ib_device *ibdev, u8 port, - struct ib_port_attr *props, - struct ib_smp *out_mad) + struct ib_port_attr *props) { - struct mlx4_ib_iboe *iboe = to_mdev(ibdev)-iboe; + + struct mlx4_ib_dev *mdev = to_mdev(ibdev); + struct mlx4_ib_iboe *iboe = mdev-iboe; struct net_device *ndev; enum ib_mtu tmp; + struct mlx4_cmd_mailbox *mailbox; + int err = 0; + + mailbox = mlx4_alloc_cmd_mailbox(mdev-dev); + if (IS_ERR(mailbox)) + return PTR_ERR(mailbox); + + err = mlx4_cmd_box(mdev-dev, 0, mailbox-dma, port, 0, + MLX4_CMD_QUERY_PORT, MLX4_CMD_TIME_CLASS_B, + MLX4_CMD_WRAPPED); + if (err) + goto out; - props-active_width = IB_WIDTH_1X; + props-active_width = (((u8 *)mailbox-buf)[5] == 0x40) ? + IB_WIDTH_4X : IB_WIDTH_1X; props-active_speed = IB_SPEED_QDR; props-port_cap_flags = IB_PORT_CM_SUP; - props-gid_tbl_len = to_mdev(ibdev)-dev-caps.gid_table_len[port]; - props-max_msg_sz = to_mdev(ibdev)-dev-caps.max_msg_sz; + props-gid_tbl_len = mdev-dev-caps.gid_table_len[port]; + props-max_msg_sz = mdev-dev-caps.max_msg_sz; props-pkey_tbl_len = 1; - props-bad_pkey_cntr= be16_to_cpup((__be16 *) (out_mad-data + 46)); - props-qkey_viol_cntr = be16_to_cpup((__be16 *) (out_mad-data + 48)); props-max_mtu = IB_MTU_4096; - props-subnet_timeout = 0; - props-max_vl_num = out_mad-data[37] 4; - props-init_type_reply = 0; + props-max_vl_num = 2; props-state= IB_PORT_DOWN; props-phys_state = state_to_phys_state(props-state); props-active_mtu = IB_MTU_256; spin_lock(iboe-lock); ndev = iboe-netdevs[port - 1]; if (!ndev) - goto out; + goto out_unlock; tmp = iboe_get_mtu(ndev-mtu); props-active_mtu = tmp ? min(props-max_mtu, tmp) : IB_MTU_256; @@ -284,41 +311,23 @@ static int eth_link_query_port(struct ib_device *ibdev, u8 port,
[PATCH 4/6] net/mlx4: allow for dynamic mtu configuration for IB ports
Set the mtu for IB ports by the driver instead of using the firmware 2KB default, with the default being 4KB. Allow for dynamic mtu configuration through a new, per port sysfs entry. As there's a dependency between the port mtu to the maximal number of HW VLs the port can support, apply a mim/max approach, using a loop that goes down from the highest possible number of VLs to the lowest. Use the firmware return status as an indication for the requested number of VLs being impossible with that mtu. For now, and as done with the dynamic link type change / VPI support, the sysfs entry to change the mtu is exposed only when running in NON SRIOV mode. To allow changing the mtu for the master in SRIOV mode, PF initiated FLR (Function Level Reset) has to be implemented. Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- Roland, on v1 - http://marc.info/?l=linux-rdmam=130636143927387w=2 you made a comment saying I mean set the MTU port-by-port with the module loaded, the same way we are supposed to be able to do for the port type. Rather than having one global module parameter, see thread http://marc.info/?l=linux-rdmam=130632548008337w=2, this posting addresses that comment changes from v1 - added ib_port_mtu field to mlx4 device ports and use it on SET_PORT - removed module param - added sysfs entries to set and show the port IB mtu - rebased the patch against Linus tree which has the SRIOV patches Also, some code (enum ibta_mtu and ibta_mtu_to_int) could be eliminated if we allow for the mlx4_core driver to #include rdma/ib_verbs.h basically I don't see a concrete problem arising from such inclusion. drivers/net/ethernet/mellanox/mlx4/main.c | 118 - drivers/net/ethernet/mellanox/mlx4/mlx4.h |2 + drivers/net/ethernet/mellanox/mlx4/port.c | 25 ++- include/linux/mlx4/device.h |1 + 4 files changed, 141 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 6bb62c5..7a58613 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -575,7 +575,6 @@ static ssize_t show_port_type(struct device *dev, return strlen(buf); } - static ssize_t set_port_type(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) @@ -648,6 +647,99 @@ out: return err ? err : count; } +enum ibta_mtu { + IB_MTU_256 = 1, + IB_MTU_512 = 2, + IB_MTU_1024 = 3, + IB_MTU_2048 = 4, + IB_MTU_4096 = 5 +}; + +static inline int int_to_ibta_mtu(int mtu) +{ + switch (mtu) { + case 256: return IB_MTU_256; + case 512: return IB_MTU_512; + case 1024: return IB_MTU_1024; + case 2048: return IB_MTU_2048; + case 4096: return IB_MTU_4096; + default: return -1; + } +} + +static inline int ibta_mtu_to_int(enum ibta_mtu mtu) +{ + switch (mtu) { + case IB_MTU_256: return 256; + case IB_MTU_512: return 512; + case IB_MTU_1024: return 1024; + case IB_MTU_2048: return 2048; + case IB_MTU_4096: return 4096; + default: return -1; + } +} + +static ssize_t show_port_ib_mtu(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + struct mlx4_port_info *info = container_of(attr, struct mlx4_port_info, + port_mtu_attr); + struct mlx4_dev *mdev = info-dev; + + if (mdev-caps.port_type[info-port] == MLX4_PORT_TYPE_ETH) + mlx4_warn(mdev, port level mtu is only used for IB ports\n); + + sprintf(buf, %d\n, + ibta_mtu_to_int(mdev-caps.port_ib_mtu[info-port])); + return strlen(buf); +} + +static ssize_t set_port_ib_mtu(struct device *dev, +struct device_attribute *attr, +const char *buf, size_t count) +{ + struct mlx4_port_info *info = container_of(attr, struct mlx4_port_info, + port_mtu_attr); + struct mlx4_dev *mdev = info-dev; + struct mlx4_priv *priv = mlx4_priv(mdev); + int err, port, mtu, ibta_mtu = -1; + + if (mdev-caps.port_type[info-port] == MLX4_PORT_TYPE_ETH) { + mlx4_warn(mdev, port level mtu is only used for IB ports\n); + return -EINVAL; + } + + err = sscanf(buf, %d, mtu); + if (err 0) + ibta_mtu = int_to_ibta_mtu(mtu); + + if (err = 0 || ibta_mtu 0) { + mlx4_err(mdev, %s is invalid IBTA mtu\n, buf); + return -EINVAL; + } + + mdev-caps.port_ib_mtu[info-port] = ibta_mtu; + + mlx4_stop_sense(mdev); + mutex_lock(priv-port_mutex); + mlx4_unregister_device(mdev); + for (port = 1; port =
[PATCH 5/6] IB: change CQ TCP/IP checksum offload mark to use bit flag indication instead of integer
Use a bit in wc_flags rather then a whole integer, this change for itself doesn't reduce the size of struct ib_wc on 64bit machines, so it stays on 56 bytes as of padding. However, it will allow to add more fields in the future w.o enlarging that structure. Also, it will serve to have unified approach with future libibverbs checksum offload reporting towards applications, where adding a bit flag doesn't break the library ABI. This patch was during conversation with Liran Liss lir...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/hw/mlx4/cq.c |3 ++- drivers/infiniband/hw/mthca/mthca_cq.c |3 ++- drivers/infiniband/hw/qib/qib_rc.c |1 - drivers/infiniband/hw/qib/qib_uc.c |1 - drivers/infiniband/ulp/ipoib/ipoib_ib.c |3 ++- include/rdma/ib_verbs.h |2 +- 6 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c index 5ecf38d..09eb1d1 100644 --- a/drivers/infiniband/hw/mlx4/cq.c +++ b/drivers/infiniband/hw/mlx4/cq.c @@ -720,7 +720,8 @@ repoll: wc-dlid_path_bits = (g_mlpath_rqpn 24) 0x7f; wc-wc_flags |= g_mlpath_rqpn 0x8000 ? IB_WC_GRH : 0; wc-pkey_index = be32_to_cpu(cqe-immed_rss_invalid) 0x7f; - wc-csum_ok= mlx4_ib_ipoib_csum_ok(cqe-status, cqe-checksum); + wc-wc_flags |= mlx4_ib_ipoib_csum_ok(cqe-status, + cqe-checksum) ? IB_WC_IP_CSUM_OK : 0; if (rdma_port_get_link_layer(wc-qp-device, (*cur_qp)-port) == IB_LINK_LAYER_ETHERNET) wc-sl = be16_to_cpu(cqe-sl_vid) 13; diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c index 53157b8..40ba833 100644 --- a/drivers/infiniband/hw/mthca/mthca_cq.c +++ b/drivers/infiniband/hw/mthca/mthca_cq.c @@ -643,7 +643,8 @@ static inline int mthca_poll_one(struct mthca_dev *dev, entry-wc_flags |= cqe-g_mlpath 0x80 ? IB_WC_GRH : 0; checksum = (be32_to_cpu(cqe-rqpn) 24) | ((be32_to_cpu(cqe-my_ee) 16) 0xff00); - entry-csum_ok = (cqe-sl_ipok 1 checksum == 0x); + entry-wc_flags |= (cqe-sl_ipok 1 checksum == 0x) ? + IB_WC_IP_CSUM_OK : 0; } entry-status = IB_WC_SUCCESS; diff --git a/drivers/infiniband/hw/qib/qib_rc.c b/drivers/infiniband/hw/qib/qib_rc.c index 894afac..765b4cb 100644 --- a/drivers/infiniband/hw/qib/qib_rc.c +++ b/drivers/infiniband/hw/qib/qib_rc.c @@ -2048,7 +2048,6 @@ send_last: wc.pkey_index = 0; wc.dlid_path_bits = 0; wc.port_num = 0; - wc.csum_ok = 0; /* Signal completion event if the solicited bit is set. */ qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc, (ohdr-bth[0] diff --git a/drivers/infiniband/hw/qib/qib_uc.c b/drivers/infiniband/hw/qib/qib_uc.c index 847e7af..7ce2ac2 100644 --- a/drivers/infiniband/hw/qib/qib_uc.c +++ b/drivers/infiniband/hw/qib/qib_uc.c @@ -422,7 +422,6 @@ last_imm: wc.pkey_index = 0; wc.dlid_path_bits = 0; wc.port_num = 0; - wc.csum_ok = 0; /* Signal completion event if the solicited bit is set. */ qib_cq_enter(to_icq(qp-ibqp.recv_cq), wc, (ohdr-bth[0] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 4115be5..5c1bc99 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -296,7 +296,8 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) dev-stats.rx_bytes += skb-len; skb-dev = dev; - if ((dev-features NETIF_F_RXCSUM) likely(wc-csum_ok)) + if ((dev-features NETIF_F_RXCSUM) + likely(wc-wc_flags IB_WC_IP_CSUM_OK)) skb-ip_summed = CHECKSUM_UNNECESSARY; napi_gro_receive(priv-napi, skb); diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 01179b0..31996e3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -518,6 +518,7 @@ enum ib_wc_flags { IB_WC_GRH = 1, IB_WC_WITH_IMM = (11), IB_WC_WITH_INVALIDATE = (12), + IB_WC_IP_CSUM_OK= (13), }; struct ib_wc { @@ -538,7 +539,6 @@ struct ib_wc { u8 sl; u8 dlid_path_bits; u8 port_num; /* valid only for DR SMPs on switches */ - int csum_ok; }; enum ib_cq_notify_flags { -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe
[PATCH 6/6] IB/ipoib: add support for clones / multiple childs on the same partition
Allow creating clone child interfaces which further partition an IPoIB interface to sub interfaces which either use the same pkey as their parent or use the same pkey as an already created child interface. Each child now has a child index, which together with the pkey is used as the identifier of the created network device. Child interfaces can still be created/deleted using only the pkey, and they are referred to as legacy childs. Non-legacy, clone childs of IPoIB device named ibN are named ibN.pkey.index and those using the same pkey as the parent ibN device are named ibN.index, where (0x8001 = pkey = 0x) as before this change, and (0 index = 255), example device names are: legacy - ib0, ib1, ib0.8001, ib0.8002 non-legacy (clones) - ib0.1, ib0.2, ib0.8001.1, ib0.8001.2 All sorts of childs are still created/deleted through sysfs, in a similar manner to the way legacy child interfaces are. A major use case for clone childs is for virtualization purposes, e.g under schemes where a per VM NIC / HW queue is desired at the hypervisor level. Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- for the case of non-legacy and same pkey childs I wanted to use a notation of ibN.pkey:index and ibN:index but this is problematic with tools (e.g ifconfig) who treat devices with colon in their names as aliases which are restriced, e.g w.r.t counters, etc, any ideas? Documentation/infiniband/ipoib.txt| 24 ++ drivers/infiniband/ulp/ipoib/ipoib.h |9 - drivers/infiniband/ulp/ipoib/ipoib_main.c | 48 +--- drivers/infiniband/ulp/ipoib/ipoib_vlan.c | 40 +++ 4 files changed, 93 insertions(+), 28 deletions(-) diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt index 64eeb55..2c958f8 100644 --- a/Documentation/infiniband/ipoib.txt +++ b/Documentation/infiniband/ipoib.txt @@ -24,6 +24,30 @@ Partitions and P_Keys The P_Key for any interface is given by the pkey file, and the main interface for a subinterface is in parent. +Clones + + It is possible to further partition an IPoIB interfaces, and create + clone child interfaces which either use the same pkey as their + parent, or as an already created child interface. Each child now has + a child index, which together with the pkey is used as the identifier + of the created network device. + + All sorts of childs are still created/deleted through sysfs, in a + similar manner to the way conventional child interfaces are, for example: + +echo 0x8001.1 /sys/class/net/ib0/create_child + + will create an interface named ib0.8001.1 with P_Key 0x8001 and index 1 + +echo .1 /sys/class/net/ib0/create_child + + will create an interface named ib0.1 with same P_Key as ib0 and index 1 + + To remove a such subinterface, use the delete_child file: + +echo 0x8001.1 /sys/class/net/ib0/create_child +echo .1 /sys/class/net/ib0/create_child + Datagram vs Connected modes The IPoIB driver supports two modes of operation: datagram and diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index b3cc1e0..53e021d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -101,6 +101,8 @@ enum { MAX_SEND_CQE = 16, IPOIB_CM_COPYBREAK= 256, + + MAX_CHILDS_PER_PKEY = 255, }; #defineIPOIB_OP_RECV (1ul 31) @@ -330,6 +332,7 @@ struct ipoib_dev_priv { struct net_device *parent; struct list_head child_intfs; struct list_head list; + int child_index; #ifdef CONFIG_INFINIBAND_IPOIB_CM struct ipoib_cm_dev_priv cm; @@ -488,8 +491,10 @@ void ipoib_transport_dev_cleanup(struct net_device *dev); void ipoib_event(struct ib_event_handler *handler, struct ib_event *record); -int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey); -int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey); +int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey, + unsigned char clone_index); +int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey, + unsigned char clone_index); void ipoib_pkey_poll(struct work_struct *work); int ipoib_pkey_dev_delay_open(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 3514ca0..3a6848d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1110,17 +1110,44 @@ int ipoib_add_umcast_attr(struct net_device *dev) return device_create_file(dev-dev, dev_attr_umcast); } +static int parse_child(struct device *dev, const char *buf, int *pkey, + int *child_index) +{ + int ret; + struct ipoib_dev_priv *priv = netdev_priv(to_net_dev(dev)); + + *pkey =
RE: [PATCH 1/6] IB: use central enum for speed instead of hard-coded values
+enum ib_port_seed { typo on enum name (seed - speed) The rest of the patch looks good to me. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/6] IB/core: fix wrong display of rate in sysfs
@@ -213,12 +213,18 @@ static ssize_t rate_show(struct ib_port *p, struct port_attribute *unused, } rate *= ib_width_enum_to_int(attr.active_width); + if (rate 0) return -EINVAL; - return sprintf(buf, %d%s Gb/sec (%dX%s)\n, -rate, (attr.active_speed == IB_SPEED_SDR) ? .5 : , -ib_width_enum_to_int(attr.active_width), speed); + if (attr.active_speed == IB_SPEED_SDR) { + rate = (25 * ib_width_enum_to_int(attr.active_width)) / 10; + return sprintf(buf, %d%s Gb/sec (%dX%s)\n, rate, +(attr.active_width == IB_WIDTH_1X) ? .5 : , + ib_width_enum_to_int(attr.active_width), speed); + } else + return sprintf(buf, %d Gb/sec (%dX%s)\n, +rate, ib_width_enum_to_int(attr.active_width), speed); } This function ends up like this, with my comments inline: rate = (25 * attr.active_speed) / 10; We set rate here, but... switch (attr.active_speed) { case 2: speed = DDR; break; case 4: speed = QDR; break; case 8: speed = FDR10; rate = 10; break; case 16: speed = FDR; rate = 14; break; case 32: speed = EDR; rate = 25; break; } The above switch statement changes the value for 3 of the 5 cases, with 1 case not covered. rate *= ib_width_enum_to_int(attr.active_width); It is modified again here if (rate 0) return -EINVAL; if (attr.active_speed == IB_SPEED_SDR) { rate = (25 * ib_width_enum_to_int(attr.active_width)) / 10; And we override it in this case, which is the missing switch case. return sprintf(buf, %d%s Gb/sec (%dX%s)\n, rate, (attr.active_width == IB_WIDTH_1X) ? .5 : , ib_width_enum_to_int(attr.active_width), speed); } else return sprintf(buf, %d Gb/sec (%dX%s)\n, rate, ib_width_enum_to_int(attr.active_width), speed); How about just setting rate (possibly as a string value) one time in each switch case? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6] IB/core: fix wrong display of rate in sysfs
On Wed, Jan 11, 2012 at 9:30 PM, Hefty, Sean sean.he...@intel.com wrote: This function ends up like this, with my comments inline: rate = (25 * attr.active_speed) / 10; We set rate here, but [...] The above switch statement changes the value for 3 of the 5 cases, with 1 case not covered. rate *= ib_width_enum_to_int(attr.active_width); It is modified again here Yep, this function is surely not programming state of the art.. however, except for SDR, the above line doing rate = rate * ib_width_enum_to_int(attr.active_width) would hold for all the speeds and all the widths (1X, 4X, 12X), if we just set rate (possibly as a string value) one time in each switch case, we would need to repeat that multiplication there... I'm okay with that, would you recommend to go that way? Roland any preference? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Upstream support for multicast IBoE
Shawn Bohrer sboh...@rgmadvisors.com wrote: Is there any estimate on when we might see something like this upstream? Could you elaborate a little on your use case for multicast IBoE traffic? e.g how the setup looks like and how are your Ethernet switches act to route that traffic. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB/iSER major problems with Linux 3.0 and Solaris targets
Sebastian Riemer sebastian.rie...@profitbricks.com wrote: [...] we've also tested a 3.0.15 mainline kernel with in-tree IB modules together with the OFED-1.5.4 user-space and this has much better IPoIB performance than the kernel stuff from OFED. So, we want to use them instead, but there is the same problem with the iSER debug messages and iSER doesn't work together with our Solaris 11 COMSTAR targets. On the full part of the glass, its nice to hear the upstream has good IPoIB performance, on the other part, I'll give 3.0.15 a try tomorrow, however, the error you're getting iser_drain_tx_cq:tx id 88402391f898 status 4 vend_err 57 means that iser got local protection error (=4) on the first buffer we used with IB (the connection handshake buffers belong to the IB CM, not to the ULP) which is the login request. Sounds like something is broken maybe dma mapping wise, for this reason I think its likely that the problem might not hit me on my testbed (but I will try, sure). What environment you're running in - is that Dom0 for some hypervisor? if yes, is it KVM which means plain Linux or Xen? it would help if you send me your (compressed, please) .config - send to my @mellanox email. Also, please run with iser debug logs open, set debug_level=2 (lets discuss the extra verbosity later, please) and also the session debug param for the libiscsi module. Is there a chance you can build the latest kernel 3.2 for your environment and run with it once to see if the problem happens there as well? Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Send with immediate data completion
From my interpretation of the spec, it looks like a bug and the immediate data should be available in the send wc, since it's not explicitly excluded. I'll verify that this is the intent with the IBTA. I'm still waiting on feedback from the IBTA, but they are looking into the matter. I haven't traced through the library and driver code to see if mlx4 driver can report it. I'll see if I can do this within the next couple of days. The following patch to libmlx4 added the immediate data to the send work completion for me. A similar change looks like it would be needed in the kernel. Report immediate data with send work completions From: Sean Hefty sean.he...@intel.com Signed-off-by: Sean Hefty sean.he...@intel.com --- src/cq.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/src/cq.c b/src/cq.c index 8226b6b..ad1d20c 100644 --- a/src/cq.c +++ b/src/cq.c @@ -46,6 +46,7 @@ #include mlx4.h #include doorbell.h +#include wqe.h enum { MLX4_CQ_DOORBELL= 0x20 @@ -194,6 +195,7 @@ static int mlx4_poll_one(struct mlx4_cq *cq, { struct mlx4_wq *wq; struct mlx4_cqe *cqe; + struct mlx4_wqe_ctrl_seg *ctrl; struct mlx4_srq *srq; uint32_t qpn; uint32_t g_mlpath_rqpn; @@ -240,6 +242,8 @@ static int mlx4_poll_one(struct mlx4_cq *cq, wq = (*cur_qp)-sq; wqe_index = ntohs(cqe-wqe_index); wq-tail += (uint16_t) (wqe_index - (uint16_t) wq-tail); + ctrl = (*cur_qp)-buf.buf + (*cur_qp)-sq.offset + + ((wq-tail (wq-wqe_cnt - 1)) (*cur_qp)-sq.wqe_shift); wc-wr_id = wq-wrid[wq-tail (wq-wqe_cnt - 1)]; ++wq-tail; } else if ((*cur_qp)-ibv_qp.srq) { @@ -265,11 +269,13 @@ static int mlx4_poll_one(struct mlx4_cq *cq, switch (cqe-owner_sr_opcode MLX4_CQE_OPCODE_MASK) { case MLX4_OPCODE_RDMA_WRITE_IMM: wc-wc_flags |= IBV_WC_WITH_IMM; + wc-imm_data = ctrl-imm; case MLX4_OPCODE_RDMA_WRITE: wc-opcode= IBV_WC_RDMA_WRITE; break; case MLX4_OPCODE_SEND_IMM: wc-wc_flags |= IBV_WC_WITH_IMM; + wc-imm_data = ctrl-imm; case MLX4_OPCODE_SEND: wc-opcode= IBV_WC_SEND; break; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Send with immediate data completion
I'm still waiting on feedback from the IBTA, but they are looking into the matter. The intent is for immediate data only to be provided on receive work completions. The IBTA will clarify the spec on this. I'll submit patches that remove setting the wc flag, which may help avoid this confusion some. Thanks, - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send with immediate data completion
On Wed, Jan 11, 2012 at 2:22 PM, Hefty, Sean sean.he...@intel.com wrote: The intent is for immediate data only to be provided on receive work completions. The IBTA will clarify the spec on this. I'll submit patches that remove setting the wc flag, which may help avoid this confusion some. The unfortunate thing is that we never defined enum values like IBV_WC_RDMA_WRITE_WITH_IMM to go along with IBV_WR_RDMA_WRITE_WITH_IMM If there was a good reason for that, I've long since forgotten it. But in any case we don't have a good way to return the opcode in the completion queue entry for an RDMA write with immediate request, except the way we do it now. Perhaps the only way forward is with a documentation update? - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Send with immediate data completion
On Wed, Jan 11, 2012 at 03:35:58PM -0800, Roland Dreier wrote: On Wed, Jan 11, 2012 at 2:22 PM, Hefty, Sean sean.he...@intel.com wrote: The intent is for immediate data only to be provided on receive work completions. ?The IBTA will clarify the spec on this. ?I'll submit patches that remove setting the wc flag, which may help avoid this confusion some. The unfortunate thing is that we never defined enum values like IBV_WC_RDMA_WRITE_WITH_IMM to go along with IBV_WR_RDMA_WRITE_WITH_IMM If there was a good reason for that, I've long since forgotten it. Do you think there is a need to have a WC discern if there was an attached immediate data? There is no resource attached to the immediate data that needs special handling. That is about the only argument I can see for continuing to set the IBV_WC_WITH_IMM flag on the WC. All I think is really needed here is a firm note someplace that imm_data is only valid if IBV_WC_RECV is set, while IBV_WC_WITH_IMM is set based on the opcode. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/qib: detour pcie_caps for certain chip sets
Commit 8d4548f2b (IB/qib: Default some module parameters optimally) introduced an issue with older root complexes. They cannot handle the pcie_caps of 0x51. A typical diagnostic in this situation reported by syslog contains the text: [PCIe Poisoned TLP][Send DMA memory read] Insure the 0x51 is not used on the suspect chip sets. Reviewed-by: Mark Debbage mark.debb...@qlogic.com Signed-off-by: Mike Marciniszyn mike.marcinis...@qlogic.com --- drivers/infiniband/hw/qib/qib_pcie.c | 19 ++- 1 files changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_pcie.c b/drivers/infiniband/hw/qib/qib_pcie.c index 0de55c0..bb23ba0 100644 --- a/drivers/infiniband/hw/qib/qib_pcie.c +++ b/drivers/infiniband/hw/qib/qib_pcie.c @@ -577,7 +577,7 @@ static int qib_tune_pcie_coalesce(struct qib_devdata *dd) * BIOS may not set PCIe bus-utilization parameters for best performance. * Check and optionally adjust them to maximize our throughput. */ -static int qib_pcie_caps = 0x51; +static int qib_pcie_caps; module_param_named(pcie_caps, qib_pcie_caps, int, S_IRUGO); MODULE_PARM_DESC(pcie_caps, Max PCIe tuning: Payload (0..3), ReadReq (4..7)); @@ -589,6 +589,7 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd) u16 pcaps, pctl, ecaps, ectl; int rc_sup, ep_sup; int rc_cur, ep_cur; + int caps = 0x51; /* may be overridden below */ /* Find out supported and configured values for parent (root) */ parent = dd-pcidev-bus-self; @@ -609,6 +610,14 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd) pci_read_config_word(dd-pcidev, epos + PCI_EXP_DEVCTL, ectl); } else goto bail; + if (!qib_pcie_caps) { + u16 devid = parent-device; + if ((devid = 0x25e2 devid = 0x25fa) || + (devid = 0x65e2 devid = 0x65fa) || + (devid = 0x4021 devid = 0x402e)) + caps = 0; + } else + caps = qib_pcie_caps; ret = 0; /* Find max payload supported by root, endpoint */ rc_sup = fld2val(pcaps, PCI_EXP_DEVCAP_PAYLOAD); @@ -620,8 +629,8 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd) ep_cur = fld2val(ectl, PCI_EXP_DEVCTL_PAYLOAD); /* If Supported greater than limit in module param, limit it */ - if (rc_sup (qib_pcie_caps 7)) - rc_sup = qib_pcie_caps 7; + if (rc_sup (caps 7)) + rc_sup = caps 7; /* If less than (allowed, supported), bump root payload */ if (rc_sup rc_cur) { rc_cur = rc_sup; @@ -643,8 +652,8 @@ static int qib_tune_pcie_caps(struct qib_devdata *dd) * which is code '5' (log2(4096) - 7) */ rc_sup = 5; - if (rc_sup ((qib_pcie_caps 4) 7)) - rc_sup = (qib_pcie_caps 4) 7; + if (rc_sup ((caps 4) 7)) + rc_sup = (caps 4) 7; rc_cur = fld2val(pctl, PCI_EXP_DEVCTL_READRQ); ep_cur = fld2val(ectl, PCI_EXP_DEVCTL_READRQ); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/qib: detour pcie_caps for certain chip sets
On Wed, Jan 11, 2012 at 10:00:49PM -0500, Mike Marciniszyn wrote: Commit 8d4548f2b (IB/qib: Default some module parameters optimally) introduced an issue with older root complexes. They cannot handle the pcie_caps of 0x51. Should whatever this issue is be a general PCI fixup? Like broken MSI, etc. Might be nice to include what 0x51 tunes in the commit to aide other peoole with the broken chipset :) Isn't it necesary to check the PCI vendor as well as the devid? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/qib: detour pcie_caps for certain chip sets
On Wed, Jan 11, 2012 at 7:00 PM, Mike Marciniszyn mike.marcinis...@qlogic.com wrote: + if (!qib_pcie_caps) { + u16 devid = parent-device; + if ((devid = 0x25e2 devid = 0x25fa) || + (devid = 0x65e2 devid = 0x65fa) || + (devid = 0x4021 devid = 0x402e)) + caps = 0; + } else Does this work on systems where the broken chipset might not be the immediate parent of the qib device (ie there are some PCIe switches in between)? Pushing knowledge of completely unrelated devices out into every device driver looks it will be impossible to maintain. If you need to know about the whole PCIe fabric to do this type of tuning, the only way I can see to make this work is to add some helper functions to set these values for a given device, and then add all the workaround handling to the core PCI driver. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/qib: detour pcie_caps for certain chip sets
On Wed, Jan 11, 2012 at 9:48 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: Isn't it necesary to check the PCI vendor as well as the devid? That definitely seems like a good idea. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][TRIVIAL] infiniband-diags/configure.in: Fix typo
On Tue, 10 Jan 2012 12:22:58 -0800 Hal Rosenstock h...@dev.mellanox.co.il wrote: Signed-off-by: Hal Rosenstock h...@mellanox.com Thanks applied, Ira --- diff --git a/configure.in b/configure.in index 5a596dc..215c1ea 100644 --- a/configure.in +++ b/configure.in @@ -65,7 +65,7 @@ IBDIAG_CONFIG_PATH_TMP1=`eval echo ${sysconfdir}` IBDIAG_CONFIG_PATH_TMP2=`echo $IBDIAG_CONFIG_PATH_TMP1 | sed 's/^NONE/$ac_default_prefix/'` IBDIAG_CONFIG_PATH=`eval echo $IBDIAG_CONFIG_PATH_TMP2`/infiniband-diags AC_SUBST(IBDIAG_CONFIG_PATH) -AC_DEFINE_UNQUOTED([IBDIAG_CONFIG_PATH], $IBDIAG_CONFIG_PATH, [Define the path to configuratios]) +AC_DEFINE_UNQUOTED([IBDIAG_CONFIG_PATH], $IBDIAG_CONFIG_PATH, [Define the path to configurations]) dnl Check if we should include test utilities AC_MSG_CHECKING(for --enable-test-utils) -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibswitches command displays wrong information when switch description string includes base
While Jason is correct that I am working to deprecate the scripts this bug still exists in the 1.5 branch. Thanks for pointing this out. As Open Fabrics is doing an additional 1.5 release I will attempt to fix before that release. Thanks, Ira On Fri, 6 Jan 2012 16:22:32 -0800 pierre orzechowski pierre.e.orzechow...@oracle.com wrote: Hello, We have a customer reporting an issue with the ibswitches command part of the infiniband-diags package. If the switch description includes the string base anywhere, then the match operation in the awk section of ibwitches does not behave as expected and output is garbled. Example : [root@slcc09sw-ib2 tmp]# grep Switch ibnetdisc.out Switch 36 S-002128468ee1a0a0 # SUN DCS 36P QDR basesw-ib3 10.242.48.23 enhanced port 0 lid 1 lmc 0 Switch 36 S-0021286cc8a3a0a0 # SUN DCS 46P QDR slcc09sw-ib3 10.242.48.24 enhanced port 0 lid 12 lmc 0 [root@slcc09sw-ib2 tmp]# ibswitches ibnetdisc.out Switch : 0x002128468ee1a0a0 ports 36 SUN DCS 36P QDR base port 0 lid 1 lmc 0 Switch : 0x0021286cc8a3a0a0 ports 36 SUN DCS 46P QDR slcc09sw-ib3 10.242.48.24 enhanced port 0 lid 12 lmc 0 The 1 liner workaround we have is to replace in red : echo $text | awk ' /^Switch/ { l=$0 desc=substr(l, match(l, #[ \t]*)+RLENGTH) pi=match(desc, port 0.*) pinfo=substr(desc, pi) desc=substr(desc, 1, pi-2) type=base ti=match(desc, type) ... With echo $text | awk ' /^Switch/ { l=$0 desc=substr(l, match(l, #[ \t]*)+RLENGTH) pi=match(desc, port 0.*) pinfo=substr(desc, pi) desc=substr(desc, 1, pi-2) type=\base\ ti=match(desc, type) ... Thanks for your help in reviewing this potential defect. Apologies if posting to the wrong mailing list. Best regards, Pierre Orzechowski -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ira Weiny Member of Technical Staff Lawrence Livermore National Lab 925-423-8008 wei...@llnl.gov -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html