[PATCH] RDMA/nes: corrected firmware version update

2010-07-14 Thread Walukiewicz, Miroslaw
Now firmware version is read from correct place

Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com
---

 drivers/infiniband/hw/nes/nes_verbs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index 0abd4f2..f179586 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -520,7 +520,7 @@ static int nes_query_device(struct ib_device *ibdev, struct 
ib_device_attr *prop
memset(props, 0, sizeof(*props));
memcpy(props-sys_image_guid, nesvnic-netdev-dev_addr, 6);
 
-   props-fw_ver = nesdev-nesadapter-fw_ver;
+   props-fw_ver = nesdev-nesadapter-firmware_version;
props-device_cap_flags = nesdev-nesadapter-device_cap_flags;
props-vendor_id = nesdev-nesadapter-vendor_id;
props-vendor_part_id = nesdev-nesadapter-vendor_part_id;




FW: [PATCH] RDMA/nes: corrected link type for nes cards

2010-07-14 Thread Walukiewicz, Miroslaw
Now correct interface link type is set for ibv_query_port() 

Signed-off-by: Mirek Walukiewicz miroslaw.walukiew...@intel.com
---

 drivers/infiniband/hw/nes/nes_verbs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index f179586..45bf56c 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -599,7 +599,7 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, 
struct ib_port_attr
props-active_width = IB_WIDTH_4X;
props-active_speed = 1;
props-max_msg_sz = 0x8000;
-
+   props-link_layer = IB_LINK_LAYER_ETHERNET;
return 0;
 }
 


N�r��yb�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: RDMA test performance comments (2)

2010-07-14 Thread Andrea Gozzelino
On Jul 13, 2010 06:55 PM, Hefty, Sean sean.he...@intel.com wrote:

  I see that this rdma_post_send call gives a big contribute to CPU
  use on
  client side. Now the CPU usage (%) is about 95%-99%.
 
 CPU utilization is usually related to how you process completions. If
 you switch from polling the CQ to using events, the CPU utilization
 will go down. This will also result in the latency going up.
 
  Could you please send me a comment also about the plateau non
  stable
  in graph speed versus buffer size with buffer size  10^5 bytes?
  (attached files)
 
 This is likely just an artifact of the test and hardware.

Hi Sean,

thank you for the clarification about CPU utilization.
I understand the smaller is CPU utilization the greater is latency.

case A) (my actual test) use CQs -- low latency about as designed | CPU
utilization = 99%

case B) use events (remove poll in CQs)-- worse latency | low CPU
utilization

Is it possible a compromise? I think the answer is negative but I'm not
an expert.
This question is open.


For the second point, I have results from 13 speed test. I put in the
graph the mean of speed for each buffer size with +- standard deviation
as symmetric error barr. It is not a complete statistical analysis but
it's an idea. I'm agree with Sean's opinion about the plateau.

Thanks a lot.
Regards,
Andrea


Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro  (LNL)
Viale dell'Universita' 2 -I-35020 - Legnaro (PD)- ITALIA
Office: E-101
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzel...@lnl.infn.it
Cell: +39 3488245552
attachment: speed_mean.GIFattachment: speed_13prove.GIFattachment: speed_mean_bit.GIF

Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-14 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
 Roland Dreier wrote:
   I guess I came to a premature conclusion. One set of tests ran fine and I 
 made that
   conclusion. Another set of tests caused the following crash:

 I don't really know how to interpret this.  Is this crash new, or is it
 the same crash you were hoping this patch fixed?
 
 This is a new crash.

I see other manifestations resulting in different crashes :

:mon t
[c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
[c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
[c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
[ib_ipoib]
[c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
[c0074603be00] c00c7190 .worker_thread+0xc0/0x180
[c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
[c0074603bf90] c00309fc .kernel_thread+0x54/0x70
9:mon e
cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
pc: c05ac390: ._spin_lock+0x20/0xc8
lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
sp: c0074603b9a0
   msr: 80009032
   dar: 3a0
 dsisr: 4000
  current = 0xc00756ce8b00
  paca= 0xc0f63800
pid   = 18095, comm = ipoib
9:mon

Thanks
Pradeep

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[infiniband-diags] [2/4] support --diffcheck in iblinkinfo [REPOST]

2010-07-14 Thread Albert Chu
Hi Sasha,

Similar to ibnetdiscover, this patch support a --diffcheck option in
iblinkinfo.

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
---BeginMessage---

Signed-off-by: Albert Chu ch...@llnl.gov
---
 infiniband-diags/man/iblinkinfo.8 |5 +
 infiniband-diags/src/iblinkinfo.c |   20 
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/infiniband-diags/man/iblinkinfo.8 
b/infiniband-diags/man/iblinkinfo.8
index b91afbd..431ab0e 100644
--- a/infiniband-diags/man/iblinkinfo.8
+++ b/infiniband-diags/man/iblinkinfo.8
@@ -58,6 +58,11 @@ output will be displayed showing differences between the old 
and current
 fabric links.  See
 .B ibnetdiscover
 for information on caching ibnetdiscover output.
+.TP
+\fB\-\-diffcheck\fR key(s)
+Specify what diff checks should be done in the \fB\-\-diff\fR option above.
+Comma separate multiple diff check key(s).  The available diff checks
+are:\fIport\fR = port connections, \fIstate\fR = port state.
 
 .SH AUTHOR
 .TP
diff --git a/infiniband-diags/src/iblinkinfo.c 
b/infiniband-diags/src/iblinkinfo.c
index 4b922af..b9c1c32 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -400,6 +400,8 @@ int diff_switch(ibnd_node_t * node, ibnd_fabric_t * 
orig_fabric,
 static int process_opt(void *context, int ch, char *optarg)
 {
struct ibnd_config *cfg = context;
+   char *p;
+
switch (ch) {
case 1:
node_name_map_file = strdup(optarg);
@@ -410,6 +412,22 @@ static int process_opt(void *context, int ch, char *optarg)
case 3:
diff_cache_file = strdup(optarg);
break;
+   case 4:
+   diffcheck_flags = 0;
+   p = strtok(optarg, ,);
+   while (p) {
+   if (!strcasecmp(p, port))
+   diffcheck_flags |= DIFF_FLAG_PORT_CONNECTION;
+   else if (!strcasecmp(p, state))
+   diffcheck_flags |= DIFF_FLAG_PORT_STATE;
+   else {
+   fprintf(stderr, invalid diff check key: %s\n,
+   p);
+   return -1;
+   }
+   p = strtok(NULL, ,);
+   }
+   break;
case 'S':
guid_str = optarg;
guid = (uint64_t) strtoull(guid_str, 0, 0);
@@ -480,6 +498,8 @@ int main(int argc, char **argv)
 filename of ibnetdiscover cache to load},
{diff, 3, 1, file,
 filename of ibnetdiscover cache to diff},
+   {diffcheck, 4, 1, key(s),
+specify checks to execute for --diff},
{outstanding_smps, 'o', 1, NULL,
 specify the number of outstanding SMP's which should be 
 issued during the scan},
-- 
1.5.4.5

---End Message---


[infiniband-diags] [4/4] support --filterdownports in iblinkinfo [REPOST]

2010-07-14 Thread Albert Chu
Hi Sasha,

This patch supports a new option called --filterdownports.  The option
will remove downports from the output if they were previously listed as
down in a cache.

This option is useful for clusters that have unpopulated switch ports.
Many system administrators look for the word Down in the iblinkinfo
output, however, that ability is limited when so many of the ports are
down all the time b/c of unpopulated ports.  This option attempts to
remove that limitation for clusters with unpopulated ports.

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
---BeginMessage---

Signed-off-by: Albert Chu ch...@llnl.gov
---
 infiniband-diags/man/iblinkinfo.8 |9 
 infiniband-diags/src/iblinkinfo.c |   40 +
 2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/infiniband-diags/man/iblinkinfo.8 
b/infiniband-diags/man/iblinkinfo.8
index 65ea919..940d008 100644
--- a/infiniband-diags/man/iblinkinfo.8
+++ b/infiniband-diags/man/iblinkinfo.8
@@ -66,6 +66,15 @@ Comma separate multiple diff check key(s).  The available 
diff checks
 are:\fIport\fR = port connections, \fIstate\fR = port state, \fIlid\fR = lids,
 \fInodedesc\fR = node descriptions.  If \fIport\fR is specified alongside 
\fIlid\fR
 or \fInodedesc\fR, remote port lids and node descriptions will also be 
compared.
+.TP
+\fB\-\-filterdownports\fR filename
+Filter downports indicated in a ibnetdiscover cache.  If a port was previously
+indicated as down in the specified cache, and is still down, do not output it 
in the
+resulting output.  This option may be particularly useful for environments
+where switches are not fully populated, thus much of the default iblinkinfo
+info is considered unuseful.  See
+.B ibnetdiscover
+for information on caching ibnetdiscover output.
 
 .SH AUTHOR
 .TP
diff --git a/infiniband-diags/src/iblinkinfo.c 
b/infiniband-diags/src/iblinkinfo.c
index ae7bbf3..80837ec 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -65,6 +65,8 @@ static nn_map_t *node_name_map = NULL;
 static char *load_cache_file = NULL;
 static char *diff_cache_file = NULL;
 static unsigned diffcheck_flags = DIFF_FLAG_DEFAULT;
+static char *filterdownports_cache_file = NULL;
+static ibnd_fabric_t *filterdownports_fabric = NULL;
 
 static uint64_t guid = 0;
 static char *guid_str = NULL;
@@ -116,6 +118,30 @@ void get_msg(char *width_msg, char *speed_msg, int 
msg_size, ibnd_port_t * port)
  buf, 64, max_speed));
 }
 
+int filterdownport_check(ibnd_node_t * node, ibnd_port_t * port)
+{
+   ibnd_node_t *fsw;
+   ibnd_port_t *fport;
+   int fistate;
+
+   fsw = ibnd_find_node_guid(filterdownports_fabric, node-guid);
+
+   if (!fsw)
+   return 0;
+
+   if (port-portnum  fsw-numports)
+   return 0;
+
+   fport = fsw-ports[port-portnum];
+
+   if (!fport)
+   return 0;
+
+   fistate = mad_get_field(fport-info, 0, IB_PORT_STATE_F);
+
+   return (fistate == IB_LINK_DOWN) ? 1 : 0;
+}
+
 void print_port(ibnd_node_t * node, ibnd_port_t * port, char *out_prefix)
 {
char width[64], speed[64], state[64], physstate[64];
@@ -142,6 +168,11 @@ void print_port(ibnd_node_t * node, ibnd_port_t * port, 
char *out_prefix)
width_msg[0] = '\0';
speed_msg[0] = '\0';
 
+   if (istate == IB_LINK_DOWN
+filterdownports_fabric
+filterdownport_check(node, port))
+   return;
+
/* C14-24.2.1 states that a down port allows for invalid data to be
 * returned for all PortInfo components except PortState and
 * PortPhysicalState */
@@ -467,6 +498,9 @@ static int process_opt(void *context, int ch, char *optarg)
p = strtok(NULL, ,);
}
break;
+   case 5:
+   filterdownports_cache_file = strdup(optarg);
+   break;
case 'S':
guid_str = optarg;
guid = (uint64_t) strtoull(guid_str, 0, 0);
@@ -539,6 +573,8 @@ int main(int argc, char **argv)
 filename of ibnetdiscover cache to diff},
{diffcheck, 4, 1, key(s),
 specify checks to execute for --diff},
+   {filterdownports, 5, 1, file,
+filename of ibnetdiscover cache to filter downports},
{outstanding_smps, 'o', 1, NULL,
 specify the number of outstanding SMP's which should be 
 issued during the scan},
@@ -593,6 +629,10 @@ int main(int argc, char **argv)
!(diff_fabric = ibnd_load_fabric(diff_cache_file, 0)))
IBERROR(loading cached fabric for diff failed\n);
 
+   if (filterdownports_cache_file 
+   !(filterdownports_fabric = 
ibnd_load_fabric(filterdownports_cache_file, 0)))
+   IBERROR(loading 

[infiniband-diags] [3/4] Add lid and node description diff options for --diffcheck in iblinkinfo [REPOST]

2010-07-14 Thread Albert Chu
Hi Sasha,

This patch supports additional lid and node description diffing options
in iblinkinfo.  This is similar to the lid and nodescription --diffcheck
options in ibnetdiscover.

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
---BeginMessage---

Signed-off-by: Albert Chu ch...@llnl.gov
---
 infiniband-diags/man/iblinkinfo.8 |7 +++-
 infiniband-diags/src/iblinkinfo.c |   53 -
 2 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/infiniband-diags/man/iblinkinfo.8 
b/infiniband-diags/man/iblinkinfo.8
index 431ab0e..65ea919 100644
--- a/infiniband-diags/man/iblinkinfo.8
+++ b/infiniband-diags/man/iblinkinfo.8
@@ -55,14 +55,17 @@ for information on caching ibnetdiscover output.
 Load cached ibnetdiscover data and do a diff comparison to the current
 network or another cache.  A special diff output for iblinkinfo
 output will be displayed showing differences between the old and current
-fabric links.  See
+fabric links.  Be default, the following are compared for differences:
+port connections and port state.  See
 .B ibnetdiscover
 for information on caching ibnetdiscover output.
 .TP
 \fB\-\-diffcheck\fR key(s)
 Specify what diff checks should be done in the \fB\-\-diff\fR option above.
 Comma separate multiple diff check key(s).  The available diff checks
-are:\fIport\fR = port connections, \fIstate\fR = port state.
+are:\fIport\fR = port connections, \fIstate\fR = port state, \fIlid\fR = lids,
+\fInodedesc\fR = node descriptions.  If \fIport\fR is specified alongside 
\fIlid\fR
+or \fInodedesc\fR, remote port lids and node descriptions will also be 
compared.
 
 .SH AUTHOR
 .TP
diff --git a/infiniband-diags/src/iblinkinfo.c 
b/infiniband-diags/src/iblinkinfo.c
index b9c1c32..ae7bbf3 100644
--- a/infiniband-diags/src/iblinkinfo.c
+++ b/infiniband-diags/src/iblinkinfo.c
@@ -55,6 +55,8 @@
 
 #define DIFF_FLAG_PORT_CONNECTION  0x01
 #define DIFF_FLAG_PORT_STATE   0x02
+#define DIFF_FLAG_LID  0x04
+#define DIFF_FLAG_NODE_DESCRIPTION 0x08
 
 #define DIFF_FLAG_DEFAULT (DIFF_FLAG_PORT_CONNECTION | DIFF_FLAG_PORT_STATE)
 
@@ -224,7 +226,7 @@ void print_port(ibnd_node_t * node, ibnd_port_t * port, 
char *out_prefix)
 
 void print_switch_header(ibnd_node_t *node, int *out_header_flag, char 
*out_prefix)
 {
-   if (!(*out_header_flag)  !line_mode) {
+   if ((!out_header_flag || !(*out_header_flag))  !line_mode) {
char *remap =
remap_node_name(node_name_map, node-guid, 
node-nodedesc);
printf(%sSwitch 0x%016 PRIx64  %s:\n,
@@ -308,9 +310,25 @@ void diff_switch_ports(ibnd_node_t * fabric1_node, 
ibnd_node_t * fabric2_node,
output_diff++;
}
 
+   if (data-diff_flags  DIFF_FLAG_PORT_CONNECTION
+data-diff_flags  DIFF_FLAG_LID
+fabric1_port  fabric2_port
+fabric1_port-remoteport  fabric2_port-remoteport
+fabric1_port-remoteport-base_lid != 
fabric2_port-remoteport-base_lid)
+   output_diff++;
+
+   if (data-diff_flags  DIFF_FLAG_PORT_CONNECTION
+data-diff_flags  DIFF_FLAG_NODE_DESCRIPTION
+fabric1_port  fabric2_port
+fabric1_port-remoteport  fabric2_port-remoteport
+memcmp(fabric1_port-remoteport-node-nodedesc,
+ fabric2_port-remoteport-node-nodedesc,
+ IB_SMP_DATA_SIZE))
+   output_diff++;
+
if (output_diff  fabric1_port) {
print_switch_header(fabric1_node,
-   head_print,
+   head_print,
NULL);
print_port(fabric1_node,
   fabric1_port,
@@ -319,7 +337,7 @@ void diff_switch_ports(ibnd_node_t * fabric1_node, 
ibnd_node_t * fabric2_node,
 
if (output_diff  fabric2_port) {
print_switch_header(fabric1_node,
-   head_print,
+   head_print,
NULL);
print_port(fabric2_node,
   fabric2_port,
@@ -340,7 +358,22 @@ void diff_switch_iter(ibnd_node_t * fabric1_node, void 
*iter_user_data)
if (!fabric2_node)
print_switch(fabric1_node, data-fabric1_prefix);
else if (data-diff_flags 
-(DIFF_FLAG_PORT_CONNECTION | DIFF_FLAG_PORT_STATE)) {
+(DIFF_FLAG_PORT_CONNECTION | DIFF_FLAG_PORT_STATE
+ | DIFF_FLAG_LID | DIFF_FLAG_NODE_DESCRIPTION)) {
+
+   if ((data-diff_flags  DIFF_FLAG_LID
+ 

[PATCH] IB/umad: Remove unused-but-set variable 'already_dead'

2010-07-14 Thread Roland Dreier
Signed-off-by: Roland Dreier rola...@cisco.com
---
 drivers/infiniband/core/user_mad.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/user_mad.c 
b/drivers/infiniband/core/user_mad.c
index 6babb72..5fa8569 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1085,7 +1085,6 @@ err_cdev:
 static void ib_umad_kill_port(struct ib_umad_port *port)
 {
struct ib_umad_file *file;
-   int already_dead;
int id;
 
dev_set_drvdata(port-dev,NULL);
@@ -1103,7 +1102,6 @@ static void ib_umad_kill_port(struct ib_umad_port *port)
 
list_for_each_entry(file, port-file_list, port_list) {
mutex_lock(file-mutex);
-   already_dead = file-agents_dead;
file-agents_dead = 1;
mutex_unlock(file-mutex);
 
-- 
1.7.1.1


-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] RDMA/nes: Rewrite expression to avoid undefined semantics

2010-07-14 Thread Roland Dreier
Change code like

x = expr(++x)

that assigns to x twice without a sequence point in between to the
intended (and well-defined)

x = expr(x + 1)

Signed-off-by: Roland Dreier rola...@cisco.com
---
I'll queue this for 2.6.36 unless someone objects.

 drivers/infiniband/hw/nes/nes_hw.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_hw.c 
b/drivers/infiniband/hw/nes/nes_hw.c
index 57874a1..f41d890 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -1970,7 +1970,7 @@ void nes_destroy_nic_qp(struct nes_vnic *nesvnic)
dev_kfree_skb(
nesvnic-nic.tx_skb[nesvnic-nic.sq_tail]);
 
-   nesvnic-nic.sq_tail = (++nesvnic-nic.sq_tail)
+   nesvnic-nic.sq_tail = (nesvnic-nic.sq_tail + 1)
 (nesvnic-nic.sq_size - 1);
}
 
-- 
1.7.1.1


-- 
Roland Dreier rola...@cisco.com || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH/RFC] RDMA/nes: Rewrite expression to avoid undefined semantics

2010-07-14 Thread Latif, Faisal

 -Original Message-
 From: Roland Dreier [mailto:rdre...@cisco.com]
 Sent: Wednesday, July 14, 2010 3:31 PM
 To: Latif, Faisal; Tung, Chien Tin; linux-rdma@vger.kernel.org
 Subject: [PATCH/RFC] RDMA/nes: Rewrite expression to avoid undefined
 semantics
 
 Change code like
 
   x = expr(++x)
 
 that assigns to x twice without a sequence point in between to the
 intended (and well-defined)
 
   x = expr(x + 1)
 
 Signed-off-by: Roland Dreier rola...@cisco.com
 ---
 I'll queue this for 2.6.36 unless someone objects.


This is fine.

Thanks
Faisal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB perf test questions

2010-07-14 Thread Jason Gunthorpe
On Wed, Jul 14, 2010 at 02:28:15PM -0600, Tom Ammon wrote:

 Also, whether I use ib_read_bw or ib_write_bw, the machine I initiate 
 the test from (in this case taildrop) shows one of its CPU cores 
 pegged at 100% for the duration of the test, but I see no CPU 
 utilization at all on the receiving node. Can someone explain to me 
 what's going on under the hood, here? I would think that read_bw would 
 load up the sending host but that write_bw would load up the receiving 
 host (or maybe vice versa), so this seems counterintuitive to me. when I 
 use the -b flag to do a bidirectional test, a single CPU core on both 
 machines pegs at 100%.

In all cases the master machine sits in a CPU bound loop waiting for
completions so it can issue more RDMA operations. The difference
between write and read is simply the RDMA op that is issued.

The slave side just sits there and the NIC does all the work.

Bi-directional mode runs a master operation on both sides..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA problem: svcrdma: Error fast registering memory for xprt ffff8803307d7400

2010-07-14 Thread Gyorgy Jeney
 I am attempting to use NFS over RDMA (over infiniband), but there is some
 problem.  The NFS filesystem can be mounted on the client, and things
 will work for some time (can read, modify, etc. the files over the mount),
 but then (at a seemingly random time) the NFS server will dump these
 lines to the logs:

 [ 4380.623922] svcrdma: Error fast registering memory for xprt 
 8803307d7400
 [ 4413.343161] svcrdma: error fast registering xdr for xprt 8803319edc00

Digging into it further, it seems like the Mellanox Infiniband driver
could somehow be involved.  Adding some trace's to the code, it's obvious
something like this is happening:

At some time sq_cq_reap() is called, which ends up like this:

  sq_cq_reap()
ib_poll_cq()
  mlx4_ib_poll_cq()
mlx4_ib_poll_one()
  mlx4_ib_handle_error_cqe()
- Which then sets wc-status to IB_WC_WR_FLUSH_ERR rather
  often, but the killer blow seems to be when
  IB_WC_REM_ACCESS_ERR is set.
- Because of the error previously, sq_cq_reap sets the XPT_CLOSE
  flag

Then, sometime later:

  fast_reg_read_chunks()
svc_rdma_fastreg()
  svc_rdma_send()
svc_rdma_send()
  - XPT_CLOSE is set and hence -ENOTCONN is returned
- Since svc_rdma_fastreg() had an error fast_reg_read_chunks() bails
  and the client seems to then hang.

I'd ask the infiband guys, what does IB_WC_WR_FLUSH_ERR and
IB_WC_REM_ACCESS_ERR mean?  Is it something drastic that should result
in hangs?

nog.

 Both client and server are running the latest vanilla 2.6.34.1 kernel
 with Mellanox Connect-X infiniband cards.  If more information is
 required, please do ask.

 BTW: I can reproduce the problem quite reliably by running the bonnie++
 benchmark on the NFS mounted filesystem.

 nog.

 ps: I'm not subscribed to the list, please CC me on all replies.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RESEND] mlx4_core: module param to limit msix vec allocation

2010-07-14 Thread Arthur Kepner

The mlx4_core driver allocates 'nreq' msix vectors (and irqs), 
where:

  nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs,
   num_possible_cpus() + 1);

ConnectX HCAs support 512 event queues (4 reserved). On a system 
with enough processors, we get:

  mlx4_core 0006:01:00.0: Requested 508 vectors, but only 256 MSI-X vectors 
available, trying again

Further attempts (by other drivers) to allocate interrupts fail, 
because mlx4_core got 'em all.

How about this?

Signed-off-by: Arthur Kepner akep...@sgi.com

---

 main.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index e3e0d54..0a316d0 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -68,6 +68,10 @@ static int msi_x = 1;
 module_param(msi_x, int, 0444);
 MODULE_PARM_DESC(msi_x, attempt to use MSI-X if nonzero);
 
+static int max_msi_x_vec = 64;
+module_param(max_msi_x_vec, int, 0444);
+MODULE_PARM_DESC(max_msi_x_vec, max MSI-X vectors we'll attempt to allocate);
+
 #else /* CONFIG_PCI_MSI */
 
 #define msi_x (0)
@@ -968,8 +972,10 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
int i;
 
if (msi_x) {
+   nreq = min_t(int, num_possible_cpus() + 1, max_msi_x_vec);
nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs,
-num_possible_cpus() + 1);
+nreq);
+
entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
if (!entries)
goto no_msi;

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: IB perf test questions

2010-07-14 Thread Boris Shpolyansky
Regarding the BW numbers: they look reasonable. You can try to improve the BW 
up to 3.2GB/s by increasing PCIe MTU from 128 to 256 byte. It usually requires 
BIOS configuration changes.

Boris Shpolyansky
Sr. Member of Technical Staff, Applications
 
Mellanox Technologies Inc.
350 Oakmead Parkway, Suite 100
Sunnyvale, CA 94085
Tel.: (408) 916 0014
Fax: (408) 585 0314
Cell: (408) 834 9365
www.mellanox.com
Mellanox on Twitter and Facebook

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Wednesday, July 14, 2010 2:17 PM
To: Tom Ammon
Cc: linux-rdma@vger.kernel.org
Subject: Re: IB perf test questions

On Wed, Jul 14, 2010 at 02:28:15PM -0600, Tom Ammon wrote:

 Also, whether I use ib_read_bw or ib_write_bw, the machine I initiate 
 the test from (in this case taildrop) shows one of its CPU cores 
 pegged at 100% for the duration of the test, but I see no CPU 
 utilization at all on the receiving node. Can someone explain to me 
 what's going on under the hood, here? I would think that read_bw would 
 load up the sending host but that write_bw would load up the receiving 
 host (or maybe vice versa), so this seems counterintuitive to me. when I 
 use the -b flag to do a bidirectional test, a single CPU core on both 
 machines pegs at 100%.

In all cases the master machine sits in a CPU bound loop waiting for
completions so it can issue more RDMA operations. The difference
between write and read is simply the RDMA op that is issued.

The slave side just sits there and the NIC does all the work.

Bi-directional mode runs a master operation on both sides..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html