[PATCH] RDMA/nes: corrected firmware version update
Now firmware version is read from correct place Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com --- drivers/infiniband/hw/nes/nes_verbs.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index 0abd4f2..f179586 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -520,7 +520,7 @@ static int nes_query_device(struct ib_device *ibdev, struct ib_device_attr *prop memset(props, 0, sizeof(*props)); memcpy(props-sys_image_guid, nesvnic-netdev-dev_addr, 6); - props-fw_ver = nesdev-nesadapter-fw_ver; + props-fw_ver = nesdev-nesadapter-firmware_version; props-device_cap_flags = nesdev-nesadapter-device_cap_flags; props-vendor_id = nesdev-nesadapter-vendor_id; props-vendor_part_id = nesdev-nesadapter-vendor_part_id;
FW: [PATCH] RDMA/nes: corrected link type for nes cards
Now correct interface link type is set for ibv_query_port() Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com --- drivers/infiniband/hw/nes/nes_verbs.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index f179586..45bf56c 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -599,7 +599,7 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr props-active_width = IB_WIDTH_4X; props-active_speed = 1; props-max_msg_sz = 0x8000; - + props-link_layer = IB_LINK_LAYER_ETHERNET; return 0; } N�r��yb�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
RE: RDMA test performance comments (2)
On Jul 13, 2010 06:55 PM, Hefty, Sean sean.he...@intel.com wrote: I see that this rdma_post_send call gives a big contribute to CPU use on client side. Now the CPU usage (%) is about 95%-99%. CPU utilization is usually related to how you process completions. If you switch from polling the CQ to using events, the CPU utilization will go down. This will also result in the latency going up. Could you please send me a comment also about the plateau non stable in graph speed versus buffer size with buffer size 10^5 bytes? (attached files) This is likely just an artifact of the test and hardware. Hi Sean, thank you for the clarification about CPU utilization. I understand the smaller is CPU utilization the greater is latency. case A) (my actual test) use CQs -- low latency about as designed | CPU utilization = 99% case B) use events (remove poll in CQs)-- worse latency | low CPU utilization Is it possible a compromise? I think the answer is negative but I'm not an expert. This question is open. For the second point, I have results from 13 speed test. I put in the graph the mean of speed for each buffer size with +- standard deviation as symmetric error barr. It is not a complete statistical analysis but it's an idea. I'm agree with Sean's opinion about the plateau. Thanks a lot. Regards, Andrea Andrea Gozzelino INFN - Laboratori Nazionali di Legnaro (LNL) Viale dell'Universita' 2 -I-35020 - Legnaro (PD)- ITALIA Office: E-101 Tel: +39 049 8068346 Fax: +39 049 641925 Mail: andrea.gozzel...@lnl.infn.it Cell: +39 3488245552 attachment: speed_mean.GIFattachment: speed_13prove.GIFattachment: speed_mean_bit.GIF
Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?
Pradeep Satyanarayana wrote: Roland Dreier wrote: I guess I came to a premature conclusion. One set of tests ran fine and I made that conclusion. Another set of tests caused the following crash: I don't really know how to interpret this. Is this crash new, or is it the same crash you were hoping this patch fixed? This is a new crash. I see other manifestations resulting in different crashes : :mon t [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib] [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 [ib_ipoib] [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0 [c0074603be00] c00c7190 .worker_thread+0xc0/0x180 [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0 [c0074603bf90] c00309fc .kernel_thread+0x54/0x70 9:mon e cpu 0x9: Vector: 300 (Data Access) at [c0074603b720] pc: c05ac390: ._spin_lock+0x20/0xc8 lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib] sp: c0074603b9a0 msr: 80009032 dar: 3a0 dsisr: 4000 current = 0xc00756ce8b00 paca= 0xc0f63800 pid = 18095, comm = ipoib 9:mon Thanks Pradeep -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[infiniband-diags] [2/4] support --diffcheck in iblinkinfo [REPOST]
Hi Sasha, Similar to ibnetdiscover, this patch support a --diffcheck option in iblinkinfo. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ---BeginMessage--- Signed-off-by: Albert Chu ch...@llnl.gov --- infiniband-diags/man/iblinkinfo.8 |5 + infiniband-diags/src/iblinkinfo.c | 20 2 files changed, 25 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/man/iblinkinfo.8 b/infiniband-diags/man/iblinkinfo.8 index b91afbd..431ab0e 100644 --- a/infiniband-diags/man/iblinkinfo.8 +++ b/infiniband-diags/man/iblinkinfo.8 @@ -58,6 +58,11 @@ output will be displayed showing differences between the old and current fabric links. See .B ibnetdiscover for information on caching ibnetdiscover output. +.TP +\fB\-\-diffcheck\fR key(s) +Specify what diff checks should be done in the \fB\-\-diff\fR option above. +Comma separate multiple diff check key(s). The available diff checks +are:\fIport\fR = port connections, \fIstate\fR = port state. .SH AUTHOR .TP diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index 4b922af..b9c1c32 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -400,6 +400,8 @@ int diff_switch(ibnd_node_t * node, ibnd_fabric_t * orig_fabric, static int process_opt(void *context, int ch, char *optarg) { struct ibnd_config *cfg = context; + char *p; + switch (ch) { case 1: node_name_map_file = strdup(optarg); @@ -410,6 +412,22 @@ static int process_opt(void *context, int ch, char *optarg) case 3: diff_cache_file = strdup(optarg); break; + case 4: + diffcheck_flags = 0; + p = strtok(optarg, ,); + while (p) { + if (!strcasecmp(p, port)) + diffcheck_flags |= DIFF_FLAG_PORT_CONNECTION; + else if (!strcasecmp(p, state)) + diffcheck_flags |= DIFF_FLAG_PORT_STATE; + else { + fprintf(stderr, invalid diff check key: %s\n, + p); + return -1; + } + p = strtok(NULL, ,); + } + break; case 'S': guid_str = optarg; guid = (uint64_t) strtoull(guid_str, 0, 0); @@ -480,6 +498,8 @@ int main(int argc, char **argv) filename of ibnetdiscover cache to load}, {diff, 3, 1, file, filename of ibnetdiscover cache to diff}, + {diffcheck, 4, 1, key(s), +specify checks to execute for --diff}, {outstanding_smps, 'o', 1, NULL, specify the number of outstanding SMP's which should be issued during the scan}, -- 1.5.4.5 ---End Message---
[infiniband-diags] [4/4] support --filterdownports in iblinkinfo [REPOST]
Hi Sasha, This patch supports a new option called --filterdownports. The option will remove downports from the output if they were previously listed as down in a cache. This option is useful for clusters that have unpopulated switch ports. Many system administrators look for the word Down in the iblinkinfo output, however, that ability is limited when so many of the ports are down all the time b/c of unpopulated ports. This option attempts to remove that limitation for clusters with unpopulated ports. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ---BeginMessage--- Signed-off-by: Albert Chu ch...@llnl.gov --- infiniband-diags/man/iblinkinfo.8 |9 infiniband-diags/src/iblinkinfo.c | 40 + 2 files changed, 49 insertions(+), 0 deletions(-) diff --git a/infiniband-diags/man/iblinkinfo.8 b/infiniband-diags/man/iblinkinfo.8 index 65ea919..940d008 100644 --- a/infiniband-diags/man/iblinkinfo.8 +++ b/infiniband-diags/man/iblinkinfo.8 @@ -66,6 +66,15 @@ Comma separate multiple diff check key(s). The available diff checks are:\fIport\fR = port connections, \fIstate\fR = port state, \fIlid\fR = lids, \fInodedesc\fR = node descriptions. If \fIport\fR is specified alongside \fIlid\fR or \fInodedesc\fR, remote port lids and node descriptions will also be compared. +.TP +\fB\-\-filterdownports\fR filename +Filter downports indicated in a ibnetdiscover cache. If a port was previously +indicated as down in the specified cache, and is still down, do not output it in the +resulting output. This option may be particularly useful for environments +where switches are not fully populated, thus much of the default iblinkinfo +info is considered unuseful. See +.B ibnetdiscover +for information on caching ibnetdiscover output. .SH AUTHOR .TP diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index ae7bbf3..80837ec 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -65,6 +65,8 @@ static nn_map_t *node_name_map = NULL; static char *load_cache_file = NULL; static char *diff_cache_file = NULL; static unsigned diffcheck_flags = DIFF_FLAG_DEFAULT; +static char *filterdownports_cache_file = NULL; +static ibnd_fabric_t *filterdownports_fabric = NULL; static uint64_t guid = 0; static char *guid_str = NULL; @@ -116,6 +118,30 @@ void get_msg(char *width_msg, char *speed_msg, int msg_size, ibnd_port_t * port) buf, 64, max_speed)); } +int filterdownport_check(ibnd_node_t * node, ibnd_port_t * port) +{ + ibnd_node_t *fsw; + ibnd_port_t *fport; + int fistate; + + fsw = ibnd_find_node_guid(filterdownports_fabric, node-guid); + + if (!fsw) + return 0; + + if (port-portnum fsw-numports) + return 0; + + fport = fsw-ports[port-portnum]; + + if (!fport) + return 0; + + fistate = mad_get_field(fport-info, 0, IB_PORT_STATE_F); + + return (fistate == IB_LINK_DOWN) ? 1 : 0; +} + void print_port(ibnd_node_t * node, ibnd_port_t * port, char *out_prefix) { char width[64], speed[64], state[64], physstate[64]; @@ -142,6 +168,11 @@ void print_port(ibnd_node_t * node, ibnd_port_t * port, char *out_prefix) width_msg[0] = '\0'; speed_msg[0] = '\0'; + if (istate == IB_LINK_DOWN +filterdownports_fabric +filterdownport_check(node, port)) + return; + /* C14-24.2.1 states that a down port allows for invalid data to be * returned for all PortInfo components except PortState and * PortPhysicalState */ @@ -467,6 +498,9 @@ static int process_opt(void *context, int ch, char *optarg) p = strtok(NULL, ,); } break; + case 5: + filterdownports_cache_file = strdup(optarg); + break; case 'S': guid_str = optarg; guid = (uint64_t) strtoull(guid_str, 0, 0); @@ -539,6 +573,8 @@ int main(int argc, char **argv) filename of ibnetdiscover cache to diff}, {diffcheck, 4, 1, key(s), specify checks to execute for --diff}, + {filterdownports, 5, 1, file, +filename of ibnetdiscover cache to filter downports}, {outstanding_smps, 'o', 1, NULL, specify the number of outstanding SMP's which should be issued during the scan}, @@ -593,6 +629,10 @@ int main(int argc, char **argv) !(diff_fabric = ibnd_load_fabric(diff_cache_file, 0))) IBERROR(loading cached fabric for diff failed\n); + if (filterdownports_cache_file + !(filterdownports_fabric = ibnd_load_fabric(filterdownports_cache_file, 0))) + IBERROR(loading
[infiniband-diags] [3/4] Add lid and node description diff options for --diffcheck in iblinkinfo [REPOST]
Hi Sasha, This patch supports additional lid and node description diffing options in iblinkinfo. This is similar to the lid and nodescription --diffcheck options in ibnetdiscover. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ---BeginMessage--- Signed-off-by: Albert Chu ch...@llnl.gov --- infiniband-diags/man/iblinkinfo.8 |7 +++- infiniband-diags/src/iblinkinfo.c | 53 - 2 files changed, 51 insertions(+), 9 deletions(-) diff --git a/infiniband-diags/man/iblinkinfo.8 b/infiniband-diags/man/iblinkinfo.8 index 431ab0e..65ea919 100644 --- a/infiniband-diags/man/iblinkinfo.8 +++ b/infiniband-diags/man/iblinkinfo.8 @@ -55,14 +55,17 @@ for information on caching ibnetdiscover output. Load cached ibnetdiscover data and do a diff comparison to the current network or another cache. A special diff output for iblinkinfo output will be displayed showing differences between the old and current -fabric links. See +fabric links. Be default, the following are compared for differences: +port connections and port state. See .B ibnetdiscover for information on caching ibnetdiscover output. .TP \fB\-\-diffcheck\fR key(s) Specify what diff checks should be done in the \fB\-\-diff\fR option above. Comma separate multiple diff check key(s). The available diff checks -are:\fIport\fR = port connections, \fIstate\fR = port state. +are:\fIport\fR = port connections, \fIstate\fR = port state, \fIlid\fR = lids, +\fInodedesc\fR = node descriptions. If \fIport\fR is specified alongside \fIlid\fR +or \fInodedesc\fR, remote port lids and node descriptions will also be compared. .SH AUTHOR .TP diff --git a/infiniband-diags/src/iblinkinfo.c b/infiniband-diags/src/iblinkinfo.c index b9c1c32..ae7bbf3 100644 --- a/infiniband-diags/src/iblinkinfo.c +++ b/infiniband-diags/src/iblinkinfo.c @@ -55,6 +55,8 @@ #define DIFF_FLAG_PORT_CONNECTION 0x01 #define DIFF_FLAG_PORT_STATE 0x02 +#define DIFF_FLAG_LID 0x04 +#define DIFF_FLAG_NODE_DESCRIPTION 0x08 #define DIFF_FLAG_DEFAULT (DIFF_FLAG_PORT_CONNECTION | DIFF_FLAG_PORT_STATE) @@ -224,7 +226,7 @@ void print_port(ibnd_node_t * node, ibnd_port_t * port, char *out_prefix) void print_switch_header(ibnd_node_t *node, int *out_header_flag, char *out_prefix) { - if (!(*out_header_flag) !line_mode) { + if ((!out_header_flag || !(*out_header_flag)) !line_mode) { char *remap = remap_node_name(node_name_map, node-guid, node-nodedesc); printf(%sSwitch 0x%016 PRIx64 %s:\n, @@ -308,9 +310,25 @@ void diff_switch_ports(ibnd_node_t * fabric1_node, ibnd_node_t * fabric2_node, output_diff++; } + if (data-diff_flags DIFF_FLAG_PORT_CONNECTION +data-diff_flags DIFF_FLAG_LID +fabric1_port fabric2_port +fabric1_port-remoteport fabric2_port-remoteport +fabric1_port-remoteport-base_lid != fabric2_port-remoteport-base_lid) + output_diff++; + + if (data-diff_flags DIFF_FLAG_PORT_CONNECTION +data-diff_flags DIFF_FLAG_NODE_DESCRIPTION +fabric1_port fabric2_port +fabric1_port-remoteport fabric2_port-remoteport +memcmp(fabric1_port-remoteport-node-nodedesc, + fabric2_port-remoteport-node-nodedesc, + IB_SMP_DATA_SIZE)) + output_diff++; + if (output_diff fabric1_port) { print_switch_header(fabric1_node, - head_print, + head_print, NULL); print_port(fabric1_node, fabric1_port, @@ -319,7 +337,7 @@ void diff_switch_ports(ibnd_node_t * fabric1_node, ibnd_node_t * fabric2_node, if (output_diff fabric2_port) { print_switch_header(fabric1_node, - head_print, + head_print, NULL); print_port(fabric2_node, fabric2_port, @@ -340,7 +358,22 @@ void diff_switch_iter(ibnd_node_t * fabric1_node, void *iter_user_data) if (!fabric2_node) print_switch(fabric1_node, data-fabric1_prefix); else if (data-diff_flags -(DIFF_FLAG_PORT_CONNECTION | DIFF_FLAG_PORT_STATE)) { +(DIFF_FLAG_PORT_CONNECTION | DIFF_FLAG_PORT_STATE + | DIFF_FLAG_LID | DIFF_FLAG_NODE_DESCRIPTION)) { + + if ((data-diff_flags DIFF_FLAG_LID +
[PATCH] IB/umad: Remove unused-but-set variable 'already_dead'
Signed-off-by: Roland Dreier rola...@cisco.com --- drivers/infiniband/core/user_mad.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 6babb72..5fa8569 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1085,7 +1085,6 @@ err_cdev: static void ib_umad_kill_port(struct ib_umad_port *port) { struct ib_umad_file *file; - int already_dead; int id; dev_set_drvdata(port-dev,NULL); @@ -1103,7 +1102,6 @@ static void ib_umad_kill_port(struct ib_umad_port *port) list_for_each_entry(file, port-file_list, port_list) { mutex_lock(file-mutex); - already_dead = file-agents_dead; file-agents_dead = 1; mutex_unlock(file-mutex); -- 1.7.1.1 -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC] RDMA/nes: Rewrite expression to avoid undefined semantics
Change code like x = expr(++x) that assigns to x twice without a sequence point in between to the intended (and well-defined) x = expr(x + 1) Signed-off-by: Roland Dreier rola...@cisco.com --- I'll queue this for 2.6.36 unless someone objects. drivers/infiniband/hw/nes/nes_hw.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_hw.c b/drivers/infiniband/hw/nes/nes_hw.c index 57874a1..f41d890 100644 --- a/drivers/infiniband/hw/nes/nes_hw.c +++ b/drivers/infiniband/hw/nes/nes_hw.c @@ -1970,7 +1970,7 @@ void nes_destroy_nic_qp(struct nes_vnic *nesvnic) dev_kfree_skb( nesvnic-nic.tx_skb[nesvnic-nic.sq_tail]); - nesvnic-nic.sq_tail = (++nesvnic-nic.sq_tail) + nesvnic-nic.sq_tail = (nesvnic-nic.sq_tail + 1) (nesvnic-nic.sq_size - 1); } -- 1.7.1.1 -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH/RFC] RDMA/nes: Rewrite expression to avoid undefined semantics
-Original Message- From: Roland Dreier [mailto:rdre...@cisco.com] Sent: Wednesday, July 14, 2010 3:31 PM To: Latif, Faisal; Tung, Chien Tin; linux-rdma@vger.kernel.org Subject: [PATCH/RFC] RDMA/nes: Rewrite expression to avoid undefined semantics Change code like x = expr(++x) that assigns to x twice without a sequence point in between to the intended (and well-defined) x = expr(x + 1) Signed-off-by: Roland Dreier rola...@cisco.com --- I'll queue this for 2.6.36 unless someone objects. This is fine. Thanks Faisal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB perf test questions
On Wed, Jul 14, 2010 at 02:28:15PM -0600, Tom Ammon wrote: Also, whether I use ib_read_bw or ib_write_bw, the machine I initiate the test from (in this case taildrop) shows one of its CPU cores pegged at 100% for the duration of the test, but I see no CPU utilization at all on the receiving node. Can someone explain to me what's going on under the hood, here? I would think that read_bw would load up the sending host but that write_bw would load up the receiving host (or maybe vice versa), so this seems counterintuitive to me. when I use the -b flag to do a bidirectional test, a single CPU core on both machines pegs at 100%. In all cases the master machine sits in a CPU bound loop waiting for completions so it can issue more RDMA operations. The difference between write and read is simply the RDMA op that is issued. The slave side just sits there and the NIC does all the work. Bi-directional mode runs a master operation on both sides.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA problem: svcrdma: Error fast registering memory for xprt ffff8803307d7400
I am attempting to use NFS over RDMA (over infiniband), but there is some problem. The NFS filesystem can be mounted on the client, and things will work for some time (can read, modify, etc. the files over the mount), but then (at a seemingly random time) the NFS server will dump these lines to the logs: [ 4380.623922] svcrdma: Error fast registering memory for xprt 8803307d7400 [ 4413.343161] svcrdma: error fast registering xdr for xprt 8803319edc00 Digging into it further, it seems like the Mellanox Infiniband driver could somehow be involved. Adding some trace's to the code, it's obvious something like this is happening: At some time sq_cq_reap() is called, which ends up like this: sq_cq_reap() ib_poll_cq() mlx4_ib_poll_cq() mlx4_ib_poll_one() mlx4_ib_handle_error_cqe() - Which then sets wc-status to IB_WC_WR_FLUSH_ERR rather often, but the killer blow seems to be when IB_WC_REM_ACCESS_ERR is set. - Because of the error previously, sq_cq_reap sets the XPT_CLOSE flag Then, sometime later: fast_reg_read_chunks() svc_rdma_fastreg() svc_rdma_send() svc_rdma_send() - XPT_CLOSE is set and hence -ENOTCONN is returned - Since svc_rdma_fastreg() had an error fast_reg_read_chunks() bails and the client seems to then hang. I'd ask the infiband guys, what does IB_WC_WR_FLUSH_ERR and IB_WC_REM_ACCESS_ERR mean? Is it something drastic that should result in hangs? nog. Both client and server are running the latest vanilla 2.6.34.1 kernel with Mellanox Connect-X infiniband cards. If more information is required, please do ask. BTW: I can reproduce the problem quite reliably by running the bonnie++ benchmark on the NFS mounted filesystem. nog. ps: I'm not subscribed to the list, please CC me on all replies. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RESEND] mlx4_core: module param to limit msix vec allocation
The mlx4_core driver allocates 'nreq' msix vectors (and irqs), where: nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs, num_possible_cpus() + 1); ConnectX HCAs support 512 event queues (4 reserved). On a system with enough processors, we get: mlx4_core 0006:01:00.0: Requested 508 vectors, but only 256 MSI-X vectors available, trying again Further attempts (by other drivers) to allocate interrupts fail, because mlx4_core got 'em all. How about this? Signed-off-by: Arthur Kepner akep...@sgi.com --- main.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index e3e0d54..0a316d0 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -68,6 +68,10 @@ static int msi_x = 1; module_param(msi_x, int, 0444); MODULE_PARM_DESC(msi_x, attempt to use MSI-X if nonzero); +static int max_msi_x_vec = 64; +module_param(max_msi_x_vec, int, 0444); +MODULE_PARM_DESC(max_msi_x_vec, max MSI-X vectors we'll attempt to allocate); + #else /* CONFIG_PCI_MSI */ #define msi_x (0) @@ -968,8 +972,10 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev) int i; if (msi_x) { + nreq = min_t(int, num_possible_cpus() + 1, max_msi_x_vec); nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs, -num_possible_cpus() + 1); +nreq); + entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL); if (!entries) goto no_msi; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: IB perf test questions
Regarding the BW numbers: they look reasonable. You can try to improve the BW up to 3.2GB/s by increasing PCIe MTU from 128 to 256 byte. It usually requires BIOS configuration changes. Boris Shpolyansky Sr. Member of Technical Staff, Applications Mellanox Technologies Inc. 350 Oakmead Parkway, Suite 100 Sunnyvale, CA 94085 Tel.: (408) 916 0014 Fax: (408) 585 0314 Cell: (408) 834 9365 www.mellanox.com Mellanox on Twitter and Facebook -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Wednesday, July 14, 2010 2:17 PM To: Tom Ammon Cc: linux-rdma@vger.kernel.org Subject: Re: IB perf test questions On Wed, Jul 14, 2010 at 02:28:15PM -0600, Tom Ammon wrote: Also, whether I use ib_read_bw or ib_write_bw, the machine I initiate the test from (in this case taildrop) shows one of its CPU cores pegged at 100% for the duration of the test, but I see no CPU utilization at all on the receiving node. Can someone explain to me what's going on under the hood, here? I would think that read_bw would load up the sending host but that write_bw would load up the receiving host (or maybe vice versa), so this seems counterintuitive to me. when I use the -b flag to do a bidirectional test, a single CPU core on both machines pegs at 100%. In all cases the master machine sits in a CPU bound loop waiting for completions so it can issue more RDMA operations. The difference between write and read is simply the RDMA op that is issued. The slave side just sits there and the NIC does all the work. Bi-directional mode runs a master operation on both sides.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html