from:"Mike Heinz"

[ewg] Generating debuginfo for a build of OFED?

2012-02-08 Thread Mike Heinz

I'm trying to track down a problem by using systemtap - but it needs the 
debuginfo for the affected modules, and the OFED installer does not create a 
debuginfo for the kernel modules.

Is there a way to turn the creation of debuginfo files on?

This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] ibdiagpath broken with TCL 8.5

2011-03-03 Thread Mike Heinz

If I get a chance, I'll take a look and see if I find an easy fix.  One simple 
thing that occurred to me was to modify ibdebug.tcl to  filter the field names 
out of the output string but I'm not sure what the side-effects would be.

-Original Message-
From: Yevgeny Kliteynik [mailto:klit...@dev.mellanox.co.il]
Sent: Thursday, March 03, 2011 5:45 AM
To: Mike Heinz
Cc: Linux RDMA; ewg@lists.openfabrics.org; Todd Rimmer
Subject: Re: ibdiagpath broken with TCL 8.5

Mike,

On 01-Mar-11 11:13 PM, Mike Heinz wrote:
> YK,
>
> I had a chance to go back and dig further into this. I just scratch-built the 
> ibis executable on an RHEL6 system, and started running it in interactive 
> mode. What I see is that results that return arrays are getting garbage 
> pre-pended to them - it looks like the root problem that John tried to patch 
> last fall, and that's causing problems for some of my systems here, is that 
> ibis isn't interfacing with TCL 8.5 correctly:
>
> % puts [smLftBlockMad dump]
> -lft 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 % puts [smVlArbTableMad
> dump] -vl_entry {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00}
> {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00}
>
> I do not see this behavior on systems running TCL 8.4:
>
> % ibis_init
> 0
> % ibis_set_port 0x00066a00a000707f
> 0
> % puts [smLftBlockMad dump]
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 % puts [smVlArbTableMad dump]
> {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0
> 0x00} {0x0 0x00}

Interesting. I tried it, and I see same results as you.
Looks like "dump" is supposed to include field names only if there are more 
than one field in the object.

With TCL 8.4, I see this:

% smVlArbTableMad dump
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} % smSwitchInfoMad dump -lin_cap 0 
-rand_cap 0 -mcast_cap 0 -lin_top 0 -def_port 0 -def_mcast_pri_port 0 
-def_mcast_not_port 0 -life_state 0 -lids_per_port 0 -enforce_cap 0 -flags 0

So VLArb Table doesn't have field name, while SwitchInfo has all its fields. I 
see similar behavior with other objects.
Ibis has an implementation of dump function for "non-trivial" objects (objects 
that are not just set of standard data types). VLArbTable would be one of them 
- it consists of VLArbTable Elements, that have their own dump function:

%typemap(tcl8, out) ib_vl_arb_element_t[ANY] {
int i;
char buff[16];
for (i=0; i <$dim0 ; i++) {
sprintf(buff, "{0x%x 0x%02x} ", $source[i].vl, 
$source[i].weight);
Tcl_AppendResult(interp, buff, NULL);
}
}

typedef struct _ibsm_vl_arb_table
{
ib_vl_arb_element_t vl_entry[IB_NUM_VL_ARB_ELEMENTS_IN_BLOCK];
} smVlArbTable;

Looks like this behavior has been changed in TCL 8.5.
IMHO, the TCL 8.5 behavior seems more consistent.
However, it is clear that in order to support 8.5 and older version, that 
simple patch is not enough.
Also, this new behavior will probably break any TCL script that was relaying on 
the old ibis output...

If I'm right, then you will see this problem also with smPkeyTableMad, 
smGuidInfoMad, smVlArbTableMad, smSlVlTableMad, smMftBlockMad, and 
smLftBlockMad MADs.
And that's only SM MADs. There are also SA, CC, and others.

Bottom line, I'm reverting the fix to allow ibdiagpath work on all the distros 
wi

[ewg] ibdiagpath broken with TCL 8.5

2011-03-01 Thread Mike Heinz

YK,

I had a chance to go back and dig further into this. I just scratch-built the 
ibis executable on an RHEL6 system, and started running it in interactive mode. 
What I see is that results that return arrays are getting garbage pre-pended to 
them - it looks like the root problem that John tried to patch last fall, and 
that's causing problems for some of my systems here, is that ibis isn't 
interfacing with TCL 8.5 correctly:

% puts [smLftBlockMad dump]
-lft 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00
% puts [smVlArbTableMad dump]
-vl_entry {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00}

I do not see this behavior on systems running TCL 8.4:

% ibis_init
0
% ibis_set_port 0x00066a00a000707f
0
% puts [smLftBlockMad dump]
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
% puts [smVlArbTableMad dump]
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00} 
{0x0 0x00} {0x0 0x00} {0x0 0x00} {0x0 0x00}

> -Original Message-
> From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
> boun...@lists.openfabrics.org] On Behalf Of Mike Heinz
> Sent: Monday, February 21, 2011 11:55 AM
> To: klit...@dev.mellanox.co.il
> Cc: Linux RDMA; ewg@lists.openfabrics.org
> Subject: Re: [ewg] Patch breaks OFED 1.5.3: [PATCH] ibdiagpath:
> Properly index VlArbTable during QoS test
>
> YK,
>
> I just finished running an RC4 build on Redhat 6. I didn't get the same
> error - but ibdiagpath still failed:
>
> [root@ifs004 1]# ibdiagpath -l 0x1,0x2
> Loading IBDIAGPATH from: /usr/lib64/ibdiagpath1.5.6
> -W- Topology file is not specified.
> Reports regarding cluster links will use direct routes.
> Loading IBDM from: /usr/lib64/ibdm1.5.6
> -I- Using port 1 as the local port.
>
> -I---
> -I- Traversing the path from local to source
> -I---
>
> -I---
> -I- Traversing the path from source to destination
> -I---
> -I- From: lid=0x0001 guid=0x00117578aca6 dev=29474 ifs004/P1
> -I- To:   lid=0x0003 guid=0x00066a01e5000108 dev=29472 Port=8
>
> -I- From: lid=0x0003 guid=0x00066a01e5000108 dev=29472 Port=8
> -I- To:   lid=0x0001 guid=0x00117578aca6 dev=29474 ifs004/P1
>
> can't read "PATH(1)": no such element in array
> [root@ifs004 1]#
>
>
> The problem appears to be occurring in this code fragment:
>
> if {[info exists NODE]} {
> for {set i 0} {$i < [llength [array names NODE
> *,PortGUID]]} {incr i} {
> set portGuid $NODE($i,PortGUID)
> set nodeGuid $G(data:NodeGuid.$portGuid)
> if {$i % 2} {
> set portNum $NODE($i,EntryPort)
> } else {
> set portNum [lindex [split $PATH([expr $i + 1]) ,]
> end] << -- Bug here. Line 2381, ibdebug_if.tcl
> }
> lappend CSV_ERRORS
> $CSV_scope,$nodeGuid,$portGuid,$portNum,$desc,$msgBody,$CSV_severity,$e
> xid,$err_type
> }
> } else {
> lappend CSV_ERRORS
> $CSV_scope,$nodeGuid,$portGuid,$portNum,$desc,$msgBody,$CSV_severity,$e
> xid,$err_type
> }
> }
>
> I don't know if it matters, but I'm testing with a one-port HCA. I
> added a puts in the offending code and got this:
>
> MHEINZ: i = 0. PATH(0) = 1
> can't read "PATH(1)": no such element in array
>
> Please let me know if there are any tests I can run for you.
>
> -Original Message-
> From: Mike Heinz
> Sent: Monday, February 21, 2011 10:40 AM
> To: 'klit...@dev.mell

Re: [ewg] Patch breaks OFED 1.5.3: [PATCH] ibdiagpath: Properly index VlArbTable during QoS test

2011-02-21 Thread Mike Heinz

YK,

I just finished running an RC4 build on Redhat 6. I didn't get the same error - 
but ibdiagpath still failed:

[root@ifs004 1]# ibdiagpath -l 0x1,0x2
Loading IBDIAGPATH from: /usr/lib64/ibdiagpath1.5.6
-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib64/ibdm1.5.6
-I- Using port 1 as the local port.

-I---
-I- Traversing the path from local to source
-I---

-I---
-I- Traversing the path from source to destination
-I---
-I- From: lid=0x0001 guid=0x00117578aca6 dev=29474 ifs004/P1
-I- To:   lid=0x0003 guid=0x00066a01e5000108 dev=29472 Port=8

-I- From: lid=0x0003 guid=0x00066a01e5000108 dev=29472 Port=8
-I- To:   lid=0x0001 guid=0x00117578aca6 dev=29474 ifs004/P1

can't read "PATH(1)": no such element in array
[root@ifs004 1]#


The problem appears to be occurring in this code fragment:

if {[info exists NODE]} {
for {set i 0} {$i < [llength [array names NODE *,PortGUID]]} {incr 
i} {
set portGuid $NODE($i,PortGUID)
set nodeGuid $G(data:NodeGuid.$portGuid)
if {$i % 2} {
set portNum $NODE($i,EntryPort)
} else {
set portNum [lindex [split $PATH([expr $i + 1]) ,] end] << 
-- Bug here. Line 2381, ibdebug_if.tcl
}
lappend CSV_ERRORS 
$CSV_scope,$nodeGuid,$portGuid,$portNum,$desc,$msgBody,$CSV_severity,$exid,$err_type
}
} else {
lappend CSV_ERRORS 
$CSV_scope,$nodeGuid,$portGuid,$portNum,$desc,$msgBody,$CSV_severity,$exid,$err_type
}
}

I don't know if it matters, but I'm testing with a one-port HCA. I added a puts 
in the offending code and got this:

MHEINZ: i = 0. PATH(0) = 1
can't read "PATH(1)": no such element in array

Please let me know if there are any tests I can run for you.

-Original Message-
From: Mike Heinz
Sent: Monday, February 21, 2011 10:40 AM
To: 'klit...@dev.mellanox.co.il'; John Jolly
Cc: ewg@lists.openfabrics.org; Linux RDMA; Todd Rimmer; Eli Dorfman (Voltaire)
Subject: RE: Patch breaks OFED 1.5.3: [ewg] [PATCH] ibdiagpath: Properly index 
VlArbTable during QoS test

Yevgeny,

It did occur to me that this is a version issue; I tested with TCL 8.4, which 
is the version included in RHEL5 and SLES10. The newest version appears to be 
8.5, skimming through the release notes I didn't see anything about languages 
changes, but if it's working for you then obviously the language has been 
changed.

The thing is, I also noticed that John's original complaint - about an extra 
item in the array - did not seem to be true on the RHEL 5.x boxes I tried, 
which is why I suggested that the entire change should be rolled back.

I'm building RC4 on a Red Hat 6 box now, I'll see if it makes a difference.

-Original Message-
From: Yevgeny Kliteynik [mailto:klit...@dev.mellanox.co.il]
Sent: Sunday, February 20, 2011 9:05 AM
To: Mike Heinz; John Jolly
Cc: ewg@lists.openfabrics.org; Linux RDMA; Todd Rimmer; Eli Dorfman (Voltaire)
Subject: Re: Patch breaks OFED 1.5.3: [ewg] [PATCH] ibdiagpath: Properly index 
VlArbTable during QoS test

Mike,

This looks like a different tcl versions/implementation issue.

I certainly can replace "$i+1" with "[expr $i+1]", but I'm not
sure about reverting the patch.

John,

What tcl version have you used?

-- YK



On 07-Feb-11 6:44 PM, Mike Heinz wrote:
> The version of  ibdiagpath included with OFED 1.5.3-rc3 contains syntax 
> errors which prevent it from executing on the systems I've tested (using TCL 
> 8.4).  Attempts to use ibdiagpath fail with an error message:
>
>> -I---
>> -I- QoS on Path Check
>> -I---
>> bad index "0+1": must be integer or end?-integer?
>
> After doing some research and debugging, I traced the problem to a patch 
> applied back in October:
>
> commit f3cf1f7c15ca24598fdf68b9ba71788b386b2f14
> Author: John Jolly
> Date:   Wed Oct 6 17:29:48 2010 +0200
>
>  ibdiagpath: Properly index VlArbTable during QoS test
>
>  Description: ibdiagpath: Properly index VlArbTable during QoS test
>  Symptom: Error 'invalid bareword "vl_entry"' during "QoS on
>   Path Check"
>  Problem: The 'dump' command within the smVlArbTableMad command
>   appends '-vl_entry' to the beginning of the array.
>   The ibdebug.tcl script does not properly handle this
>

Re: [ewg] Patch breaks OFED 1.5.3: [PATCH] ibdiagpath: Properly index VlArbTable during QoS test

2011-02-21 Thread Mike Heinz

Yevgeny,

It did occur to me that this is a version issue; I tested with TCL 8.4, which 
is the version included in RHEL5 and SLES10. The newest version appears to be 
8.5, skimming through the release notes I didn't see anything about languages 
changes, but if it's working for you then obviously the language has been 
changed.

The thing is, I also noticed that John's original complaint - about an extra 
item in the array - did not seem to be true on the RHEL 5.x boxes I tried, 
which is why I suggested that the entire change should be rolled back.

I'm building RC4 on a Red Hat 6 box now, I'll see if it makes a difference.

-Original Message-
From: Yevgeny Kliteynik [mailto:klit...@dev.mellanox.co.il]
Sent: Sunday, February 20, 2011 9:05 AM
To: Mike Heinz; John Jolly
Cc: ewg@lists.openfabrics.org; Linux RDMA; Todd Rimmer; Eli Dorfman (Voltaire)
Subject: Re: Patch breaks OFED 1.5.3: [ewg] [PATCH] ibdiagpath: Properly index 
VlArbTable during QoS test

Mike,

This looks like a different tcl versions/implementation issue.

I certainly can replace "$i+1" with "[expr $i+1]", but I'm not
sure about reverting the patch.

John,

What tcl version have you used?

-- YK



On 07-Feb-11 6:44 PM, Mike Heinz wrote:
> The version of  ibdiagpath included with OFED 1.5.3-rc3 contains syntax 
> errors which prevent it from executing on the systems I've tested (using TCL 
> 8.4).  Attempts to use ibdiagpath fail with an error message:
>
>> -I---
>> -I- QoS on Path Check
>> -I---
>> bad index "0+1": must be integer or end?-integer?
>
> After doing some research and debugging, I traced the problem to a patch 
> applied back in October:
>
> commit f3cf1f7c15ca24598fdf68b9ba71788b386b2f14
> Author: John Jolly
> Date:   Wed Oct 6 17:29:48 2010 +0200
>
>  ibdiagpath: Properly index VlArbTable during QoS test
>
>  Description: ibdiagpath: Properly index VlArbTable during QoS test
>  Symptom: Error 'invalid bareword "vl_entry"' during "QoS on
>   Path Check"
>  Problem: The 'dump' command within the smVlArbTableMad command
>   appends '-vl_entry' to the beginning of the array.
>   The ibdebug.tcl script does not properly handle this
>   extra element at the beginning of the array.
>  Solution:Offset the index value by one when referencing the
>   array.
>
>  Signed-off-by: John Jolly
>  Signed-off-by: Yevgeny Kliteynik
>
> Unfortunately, this patch isn't valid TCL code (at least not in TCL 8.4) and 
> does not appear to be needed at all.
>
> For example:
>
>> set entry [lindex $values $i+1]
>
> Is not syntactically correct TCL.  In order for it to be correct it would 
> have to be
>
>> set entry [lindex $values [expr $i+1]]
>
> However, the patch does not appear to be needed at all. Reverting the patch, 
> allows ibdiagpath to complete successfully:
>
>> -I---
>> -I- QoS on Path Check
>> -I---
>> -W- Blocked VLs:3 4 5 at node:homer lid=0x0002 guid=0x00066a00a000707f 
>> dev=25208>  port:1
>> -W- SLs:3 4 5 6 7 8 9 10 11 12 13 14 15 are blocked due to VLArb node:homer
>>  lid=0x0002 guid=0x00066a00a000707f dev=25208 in-port:0 out-port:1
>> -W- Blocked VLs:3 4 5 at node: lid=0x0001 guid=0x00066a00d9000275 dev=47396
>>  port:21
>> -W- SLs:3 4 5 6 7 8 9 10 11 12 13 14 15 mapped to VL>  5 at node: lid=0x0001
>>  guid=0x00066a00d9000275 dev=47396 in-port:14 out-port:21
>> -I- The following SLs can be used:0 1 2
>
> This message and any attached documents contain information from QLogic 
> Corporation or its wholly-owned subsidiaries that may be confidential. If you 
> are not the intended recipient, you may not read, copy, distribute, or use 
> this information. If you have received this transmission in error, please 
> notify the sender immediately by reply e-mail and then delete this message.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OFA Management maintainership

2011-02-10 Thread Mike Heinz

Good luck with the change, Sasha.

From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Sasha Khapyorsky
Sent: Wednesday, February 09, 2011 2:09 PM
To: linux-rdma
Cc: EWG
Subject: [ewg] OFA Management maintainership

Hi,

I'm finishing my work for Voltaire those days and wish to transfer
my role as OFA management packages maintainer to
Alex Netes mailto:ale...@voltaire.com>> which I know many 
years as a great
experienced engineer and very good and positive person.

So starting from today his trees should be considered as master
development trees:

git://git.openfabrics.org/~alexnetes/libibumad

git://git.openfabrics.org/~alexnetes/opensm

git://git.openfabrics.org/~alexnetes/libibmad

git://git.openfabrics.org/~alexnetes/infiniband-diags

git://git.openfabrics.org/~alexnetes/ibsim

It is also likely that in a near feature maintainerships of
libibumad and infiniband-diags will be taken by
Ira Weiny mailto:wei...@llnl.gov>>.

I would like to wish to Alex and Ira a lot of success with their roles.

Also I would like to thank a whole community for good working time.

I still be reachable by my email address 
mailto:sashakv...@gmail.com>>, so feel
free to contact me in case of any question.

Sasha

This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Patch breaks OFED 1.5.3: [PATCH] ibdiagpath: Properly index VlArbTable during QoS test

2011-02-07 Thread Mike Heinz

The version of  ibdiagpath included with OFED 1.5.3-rc3 contains syntax errors 
which prevent it from executing on the systems I've tested (using TCL 8.4).  
Attempts to use ibdiagpath fail with an error message:

> -I---
> -I- QoS on Path Check
> -I---
> bad index "0+1": must be integer or end?-integer?

After doing some research and debugging, I traced the problem to a patch 
applied back in October:

commit f3cf1f7c15ca24598fdf68b9ba71788b386b2f14
Author: John Jolly 
Date:   Wed Oct 6 17:29:48 2010 +0200

ibdiagpath: Properly index VlArbTable during QoS test

Description: ibdiagpath: Properly index VlArbTable during QoS test
Symptom: Error 'invalid bareword "vl_entry"' during "QoS on
 Path Check"
Problem: The 'dump' command within the smVlArbTableMad command
 appends '-vl_entry' to the beginning of the array.
 The ibdebug.tcl script does not properly handle this
 extra element at the beginning of the array.
Solution:Offset the index value by one when referencing the
 array.

Signed-off-by: John Jolly 
Signed-off-by: Yevgeny Kliteynik 

Unfortunately, this patch isn't valid TCL code (at least not in TCL 8.4) and 
does not appear to be needed at all.

For example:

> set entry [lindex $values $i+1]

Is not syntactically correct TCL.  In order for it to be correct it would have 
to be

> set entry [lindex $values [expr $i+1]]

However, the patch does not appear to be needed at all. Reverting the patch, 
allows ibdiagpath to complete successfully:

> -I---
> -I- QoS on Path Check
> -I---
> -W- Blocked VLs:3 4 5 at node:homer lid=0x0002 guid=0x00066a00a000707f 
> dev=25208> port:1
> -W- SLs:3 4 5 6 7 8 9 10 11 12 13 14 15 are blocked due to VLArb node:homer
> lid=0x0002 guid=0x00066a00a000707f dev=25208 in-port:0 out-port:1
> -W- Blocked VLs:3 4 5 at node: lid=0x0001 guid=0x00066a00d9000275 dev=47396
> port:21
> -W- SLs:3 4 5 6 7 8 9 10 11 12 13 14 15 mapped to VL > 5 at node: lid=0x0001
> guid=0x00066a00d9000275 dev=47396 in-port:14 out-port:21
> -I- The following SLs can be used:0 1 2

This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] umad_send.3 (man page)

2011-02-07 Thread Mike Heinz

The man page for umad_send() does not match the source code.

Signed-off-by: Michael Heinz michael.he...@qlogic.com
---
diff --git a/libibumad/man/umad_send.3 b/libibumad/man/umad_send.3
index 2d84f57..c4a617a 100644
--- a/libibumad/man/umad_send.3
+++ b/libibumad/man/umad_send.3
@@ -7,11 +7,13 @@ umad_send \- send umad
 .nf
 .B #include 
 .sp
-.BI "int umad_send(int " "portid" ", int " "agentid" ", void " "*umad" ", int 
" "timeout_ms" ", int " "retries");
+.BI "int umad_send(int " "portid" ", int " "agentid" ", void " "*umad" ", int 
" "length" ", int " "timeout_ms" ", int " "retries");
 .fi
 .SH "DESCRIPTION"
 .B umad_send()
-sends the specified
+sends
+.I length\fR
+bytes from the specified
 .I umad\fR
 buffer from the port specified by
 .I portid\fR,

This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG

2011-02-02 Thread Mike Heinz

It was discussed in the Linux-RDMA list for many months. You can find a list of 
the archived messages here:

http://www.mail-archive.com/search?q=SA+Busy&l=linux-r...@vger.kernel.org

The most recent version of the patch is here:

http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg06644.html

Basically, the spec permits an SM to reply "busy" instead of simply tossing 
packets on the floor, but OFED does not handle this case right now.

-Original Message-
From: Moni Shoua [mailto:mo...@voltaire.com]
Sent: Wednesday, February 02, 2011 10:42 AM
To: Mike Heinz
Cc: Vlad; n...@voltaire.com; linux-r...@vger.kernel.org; ewg
Subject: Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave 
an MCG

Mike Heinz wrote:
> Wouldn't the BUSY patch I proposed last year deal with this situation?
Can you please send a link to this patch?

>
> -Original Message-
> From: ewg-boun...@lists.openfabrics.org 
> [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Moni Shoua
> Sent: Wednesday, February 02, 2011 10:10 AM
> To: Vlad
> Cc: n...@voltaire.com; ewg
> Subject: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an 
> MCG
>
> This patch helps when SM is busy and so an MC group is left joined
> while the host bellies that it is was left.
>
> Note: the patch below is not to driver/infiniband/core but it generates
> a patch under kernel_patches/fixes.
>
> Index: 
> ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
> ===
> --- /dev/null   1970-01-01 00:00:00.0 +
> +++ 
> ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
>  2011-02-02 16:52:02.0 +0200
> @@ -0,0 +1,46 @@
> +Add a multicast leave maximum retry setting in 
> sys/module/ib_sa/parameters/mcast_leave_retries.
> +Add a debug print when the maximum retry count is reached.
> +
> +Signed-off-by: Nir Muchtar 
> +Reviewed-by:   Moni Shoua  
> +--
> +
> +Index: ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c
> +===
> +--- ofa_kernel-1.5.2.orig/drivers/infiniband/core/multicast.c  2010-08-17 
> 12:56:06.0 +0300
>  ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c   2010-08-17 
> 13:15:38.0 +0300
> +@@ -40,6 +40,12 @@
> + #include 
> + #include "sa.h"
> +
> ++static int mcast_leave_retries = 3;
> ++
> ++module_param_call(mcast_leave_retries, param_set_int, param_get_int,
> ++&mcast_leave_retries, 0644);
> ++MODULE_PARM_DESC(mcast_leave_retries, "Number of retries for multicast 
> leave requests before giving up");
> ++
> + static void mcast_add_one(struct ib_device *device);
> + static void mcast_remove_one(struct ib_device *device);
> +
> +@@ -520,8 +526,11 @@
> +   if (status && (group->retries > 0) &&
> +   !send_leave(group, group->leave_state))
> +   group->retries--;
> +-  else
> ++  else {
> ++  if (status && group->retries <= 0)
> ++  printk("reached max retry count. status=%d  .Giving 
> up\n", status);
> +   mcast_work_handler(&group->work);
> ++  }
> + }
> +
> + static struct mcast_group *acquire_group(struct mcast_port *port,
> +@@ -544,7 +553,7 @@
> +   if (!group)
> +   return NULL;
> +
> +-  group->retries = 3;
> ++  group->retries = mcast_leave_retries;
> +   group->port = port;
> +   group->rec.mgid = *mgid;
> +   group->pkey_index = MCAST_INVALID_PKEY_INDEX;
> ___
> ewg mailing list
> ewg@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>
>
> This message and any attached documents contain information from QLogic 
> Corporation or its wholly-owned subsidiaries that may be confidential. If you 
> are not the intended recipient, you may not read, copy, distribute, or use 
> this information. If you have received this transmission in error, please 
> notify the sender immediately by reply e-mail and then delete this message.
>
> ___
> ewg mailing list
> ewg@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
>



This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG

2011-02-02 Thread Mike Heinz

Wouldn't the BUSY patch I proposed last year deal with this situation?

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Moni Shoua
Sent: Wednesday, February 02, 2011 10:10 AM
To: Vlad
Cc: n...@voltaire.com; ewg
Subject: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG

This patch helps when SM is busy and so an MC group is left joined
while the host bellies that it is was left.

Note: the patch below is not to driver/infiniband/core but it generates
a patch under kernel_patches/fixes.

Index: 
ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ 
ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch 
2011-02-02 16:52:02.0 +0200
@@ -0,0 +1,46 @@
+Add a multicast leave maximum retry setting in 
sys/module/ib_sa/parameters/mcast_leave_retries.
+Add a debug print when the maximum retry count is reached.
+
+Signed-off-by: Nir Muchtar 
+Reviewed-by:   Moni Shoua  
+--
+
+Index: ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c
+===
+--- ofa_kernel-1.5.2.orig/drivers/infiniband/core/multicast.c  2010-08-17 
12:56:06.0 +0300
 ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c   2010-08-17 
13:15:38.0 +0300
+@@ -40,6 +40,12 @@
+ #include 
+ #include "sa.h"
+
++static int mcast_leave_retries = 3;
++
++module_param_call(mcast_leave_retries, param_set_int, param_get_int,
++&mcast_leave_retries, 0644);
++MODULE_PARM_DESC(mcast_leave_retries, "Number of retries for multicast leave 
requests before giving up");
++
+ static void mcast_add_one(struct ib_device *device);
+ static void mcast_remove_one(struct ib_device *device);
+
+@@ -520,8 +526,11 @@
+   if (status && (group->retries > 0) &&
+   !send_leave(group, group->leave_state))
+   group->retries--;
+-  else
++  else {
++  if (status && group->retries <= 0)
++  printk("reached max retry count. status=%d  .Giving 
up\n", status);
+   mcast_work_handler(&group->work);
++  }
+ }
+
+ static struct mcast_group *acquire_group(struct mcast_port *port,
+@@ -544,7 +553,7 @@
+   if (!group)
+   return NULL;
+
+-  group->retries = 3;
++  group->retries = mcast_leave_retries;
+   group->port = port;
+   group->rec.mgid = *mgid;
+   group->pkey_index = MCAST_INVALID_PKEY_INDEX;
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


This message and any attached documents contain information from QLogic 
Corporation or its wholly-owned subsidiaries that may be confidential. If you 
are not the intended recipient, you may not read, copy, distribute, or use this 
information. If you have received this transmission in error, please notify the 
sender immediately by reply e-mail and then delete this message.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Need help for Infiniband optimisation for our cluster (MTU...)

2010-12-07 Thread Mike Heinz

Richard - that's odd, I don't see an "ibv_portstat" command on my boxes - do 
you know what package provides it?

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Richard Croucher
Sent: Tuesday, December 07, 2010 11:17 AM
To: 'giggzounet'; ewg@lists.openfabrics.org
Subject: Re: [ewg] Need help for Infiniband optimisation for our cluster 
(MTU...)

The InfiniBand standard allows a MTU of 4096 bytes but the HCA you are using
limits this to 2048. This can be set and queried using your SM management.

On the server side, the ibv_portstat command will show the current MTU size.

All the information you need is in the docs. 
There are also multiple parties, including myself, who offer training
workshops for this stuff.

Richard


-Original Message-
From: ewg-boun...@lists.openfabrics.org
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
Sent: 07 December 2010 15:32
To: ewg@lists.openfabrics.org
Subject: Re: [ewg] Need help for Infiniband optimisation for our cluster
(MTU...)

Hi,

Thx for your answer!

Particularly the explication between connected and datagram mode (I see
that with the IMB1 benchmarks of mpi)!

The hardware we are using in details:
- on the master: Mellanox MHGH18-XTC ConnectX with VPI adapter, single
port 20Gb/s, PCIe2.0 x8 2.5GT/s
- on the nodes: Integrated Mellanox DDR Infiniband 20Gbs ConnectX with
QSFP Connector.

How can I know the limit of the MTU size ?


On the Infiniband we are just using mpi with different CFD programs. But
always with mpi (intel mpi or openmpi). Sould I use QoS ?

Thx for your help!


Le 07/12/2010 16:09, Richard Croucher a écrit :
> Connected mode will provide more throughput but datagram mode will provide
> lower latency.  
> You don't say what HCA's you are using.  Some of the optimizations for
> Connected mode are only available for the newer ConnectX QDR HCA's.
> 
> Your HCA will probably limit the MTU size.  Leave this as large as
possible.
> 
> If you are only running a single application on the InfiniBand you need
not
> bother with QoS.   If you are running multiple, then you do need to set
> this.  This is quite complex since you need to define V'L's, their
> arbitration policies and assign SL's to them.  This is described in the
> OpenSM docs.  This is relevant even if you are using the embedded SM in
the
> switch.
> 
> AS a newbie, take a look in the ../OFED/docs  
> There is probably all you need there. Mellanox also have some useful docs
on
> their website.
> 
> -Original Message-
> From: ewg-boun...@lists.openfabrics.org
> [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
> Sent: 07 December 2010 14:01
> To: ewg@lists.openfabrics.org
> Subject: [ewg] Need help for Infiniband optimisation for our cluster
> (MTU...)
> 
> Hi,
> 
> I'm new on this list. We have in our laboratory a little cluster:
> - master 8 cores
> - 8 nodes with 12 cores
> - DDR infiniband switch Mellanox MTS3600R
> 
> On these machines we have an oscar cluster with CentOS 5.5. We have
> installed the ofed packages 1.5.1. The default config for the infiniband
> is used. So infiniband is running in connected mode.
> 
> Our cluster is used to solve CFD (Computational Fluid Dynamics)
> problems. And I'm trying to optimize the infiniband network and so I
> have several questions:
> 
> - Is it the right mailing list to ask ? (if not...where should I post ?)
> 
> - Is there a how-to on infiniband optimisation ?
> 
> - CFD computations need a lot of bandwidth. There are a lot of data
> exchange through MPI (we are using intel mpi). Has the infiniband mode
> (connected or datagram) influence in this case ? What is the "best" MTU
> for those computation ?
> 
> 
> Best regards,
> Guillaume
> 
> ___
> ewg mailing list
> ewg@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Need help for Infiniband optimisation for our cluster (MTU...)

2010-12-07 Thread Mike Heinz

That's the ipoib value. It tells Linux it supports 64k packets, but they get 
broken up when they hit the wire. I spent a few minutes looking through stock 
ofed for a tool that displays the active mtu size, but the only tool I know of 
is part of QLogic's OFED+ stack.

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
Sent: Tuesday, December 07, 2010 10:59 AM
To: ewg@lists.openfabrics.org
Subject: Re: [ewg] Need help for Infiniband optimisation for our cluster 
(MTU...)

ok Thx for all your explanations.

So the MTU value which I'm seeing on the ib0 interface (65520) is not
connected to the "real" infiniband MTU value ?


Le 07/12/2010 16:52, Mike Heinz a écrit :
> Heh. I forgot Intel sells an mpi, I thought you were saying you had 
> recompiled one of the OFED mpis with icc.
> 
> 1) For your small cluster, there's no reason not to use connected mode. The 
> only reason for providing a datagram mode with MPI is to support very large 
> clusters where there simply aren't enough system resources for every node to 
> connect with every other node.
> 
> I would still suggest experimenting with mvapich-1 (and recompiling it with 
> icc) to see if you get better performance.
> 
> 2) Similarly, for a small cluster, QoS won't give you any benefit. The 
> purpose of QoS is to divide up the fabric's bandwidth so that multiple 
> simultaneous apps can share it in a controlled way. If you're only running 
> one app at a time (which seems likely) you want that app to get all available 
> bandwidth.
> 
> I'm not sure how you check the MTU size when using stock OFED, but my memory 
> for those HCAs is that they can use 2k MTUs. You only use a smaller MTU size 
> when the larger size causes reliability problems.
> 
> -Original Message-
> From: ewg-boun...@lists.openfabrics.org 
> [mailto:ewg-bounces-zwoeplungu2dimhrp7x...@public.gmane.orgcs.org] On Behalf 
> Of giggzounet
> Sent: Tuesday, December 07, 2010 10:32 AM
> To: ewg@lists.openfabrics.org
> Subject: Re: [ewg] Need help for Infiniband optimisation for our cluster 
> (MTU...)
> 
> Hi,
> 
> Thx for your answer!
> 
> Particularly the explication between connected and datagram mode (I see
> that with the IMB1 benchmarks of mpi)!
> 
> The hardware we are using in details:
> - on the master: Mellanox MHGH18-XTC ConnectX with VPI adapter, single
> port 20Gb/s, PCIe2.0 x8 2.5GT/s
> - on the nodes: Integrated Mellanox DDR Infiniband 20Gbs ConnectX with
> QSFP Connector.
> 
> How can I know the limit of the MTU size ?
> 
> 
> On the Infiniband we are just using mpi with different CFD programs. But
> always with mpi (intel mpi or openmpi). Sould I use QoS ?
> 
> Thx for your help!
> 
> 
> Le 07/12/2010 16:09, Richard Croucher a écrit :
>> Connected mode will provide more throughput but datagram mode will provide
>> lower latency.  
>> You don't say what HCA's you are using.  Some of the optimizations for
>> Connected mode are only available for the newer ConnectX QDR HCA's.
>>
>> Your HCA will probably limit the MTU size.  Leave this as large as possible.
>>
>> If you are only running a single application on the InfiniBand you need not
>> bother with QoS.   If you are running multiple, then you do need to set
>> this.  This is quite complex since you need to define V'L's, their
>> arbitration policies and assign SL's to them.  This is described in the
>> OpenSM docs.  This is relevant even if you are using the embedded SM in the
>> switch.
>>
>> AS a newbie, take a look in the ../OFED/docs  
>> There is probably all you need there. Mellanox also have some useful docs on
>> their website.
>>
>> -Original Message-
>> From: ewg-boun...@lists.openfabrics.org
>> [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
>> Sent: 07 December 2010 14:01
>> To: ewg@lists.openfabrics.org
>> Subject: [ewg] Need help for Infiniband optimisation for our cluster
>> (MTU...)
>>
>> Hi,
>>
>> I'm new on this list. We have in our laboratory a little cluster:
>> - master 8 cores
>> - 8 nodes with 12 cores
>> - DDR infiniband switch Mellanox MTS3600R
>>
>> On these machines we have an oscar cluster with CentOS 5.5. We have
>> installed the ofed packages 1.5.1. The default config for the infiniband
>> is used. So infiniband is running in connected mode.
>>
>> Our cluster is used to solve CFD (Computational Fluid Dynamics)
>> problems. And I'm trying to

Re: [ewg] Need help for Infiniband optimisation for our cluster (MTU...)

2010-12-07 Thread Mike Heinz

Heh. I forgot Intel sells an mpi, I thought you were saying you had recompiled 
one of the OFED mpis with icc.

1) For your small cluster, there's no reason not to use connected mode. The 
only reason for providing a datagram mode with MPI is to support very large 
clusters where there simply aren't enough system resources for every node to 
connect with every other node.

I would still suggest experimenting with mvapich-1 (and recompiling it with 
icc) to see if you get better performance.

2) Similarly, for a small cluster, QoS won't give you any benefit. The purpose 
of QoS is to divide up the fabric's bandwidth so that multiple simultaneous 
apps can share it in a controlled way. If you're only running one app at a time 
(which seems likely) you want that app to get all available bandwidth.

I'm not sure how you check the MTU size when using stock OFED, but my memory 
for those HCAs is that they can use 2k MTUs. You only use a smaller MTU size 
when the larger size causes reliability problems.

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
Sent: Tuesday, December 07, 2010 10:32 AM
To: ewg@lists.openfabrics.org
Subject: Re: [ewg] Need help for Infiniband optimisation for our cluster 
(MTU...)

Hi,

Thx for your answer!

Particularly the explication between connected and datagram mode (I see
that with the IMB1 benchmarks of mpi)!

The hardware we are using in details:
- on the master: Mellanox MHGH18-XTC ConnectX with VPI adapter, single
port 20Gb/s, PCIe2.0 x8 2.5GT/s
- on the nodes: Integrated Mellanox DDR Infiniband 20Gbs ConnectX with
QSFP Connector.

How can I know the limit of the MTU size ?


On the Infiniband we are just using mpi with different CFD programs. But
always with mpi (intel mpi or openmpi). Sould I use QoS ?

Thx for your help!


Le 07/12/2010 16:09, Richard Croucher a écrit :
> Connected mode will provide more throughput but datagram mode will provide
> lower latency.  
> You don't say what HCA's you are using.  Some of the optimizations for
> Connected mode are only available for the newer ConnectX QDR HCA's.
> 
> Your HCA will probably limit the MTU size.  Leave this as large as possible.
> 
> If you are only running a single application on the InfiniBand you need not
> bother with QoS.   If you are running multiple, then you do need to set
> this.  This is quite complex since you need to define V'L's, their
> arbitration policies and assign SL's to them.  This is described in the
> OpenSM docs.  This is relevant even if you are using the embedded SM in the
> switch.
> 
> AS a newbie, take a look in the ../OFED/docs  
> There is probably all you need there. Mellanox also have some useful docs on
> their website.
> 
> -Original Message-
> From: ewg-boun...@lists.openfabrics.org
> [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
> Sent: 07 December 2010 14:01
> To: ewg@lists.openfabrics.org
> Subject: [ewg] Need help for Infiniband optimisation for our cluster
> (MTU...)
> 
> Hi,
> 
> I'm new on this list. We have in our laboratory a little cluster:
> - master 8 cores
> - 8 nodes with 12 cores
> - DDR infiniband switch Mellanox MTS3600R
> 
> On these machines we have an oscar cluster with CentOS 5.5. We have
> installed the ofed packages 1.5.1. The default config for the infiniband
> is used. So infiniband is running in connected mode.
> 
> Our cluster is used to solve CFD (Computational Fluid Dynamics)
> problems. And I'm trying to optimize the infiniband network and so I
> have several questions:
> 
> - Is it the right mailing list to ask ? (if not...where should I post ?)
> 
> - Is there a how-to on infiniband optimisation ?
> 
> - CFD computations need a lot of bandwidth. There are a lot of data
> exchange through MPI (we are using intel mpi). Has the infiniband mode
> (connected or datagram) influence in this case ? What is the "best" MTU
> for those computation ?
> 
> 
> Best regards,
> Guillaume
> 
> ___
> ewg mailing list
> ewg@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Need help for Infiniband optimisation for our cluster (MTU...)

2010-12-07 Thread Mike Heinz

When you say "connected mode" you referring to ipoib or your MPI configuration? 
You really don't want to use ipoib for HPC applications. What MPI are you 
using?  

For MPI - my personal experience is that OpenMPI is sometimes more reliable but 
Mvapich-1 offers the best performance.

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of giggzounet
Sent: Tuesday, December 07, 2010 9:01 AM
To: ewg@lists.openfabrics.org
Subject: [ewg] Need help for Infiniband optimisation for our cluster (MTU...)

Hi,

I'm new on this list. We have in our laboratory a little cluster:
- master 8 cores
- 8 nodes with 12 cores
- DDR infiniband switch Mellanox MTS3600R

On these machines we have an oscar cluster with CentOS 5.5. We have
installed the ofed packages 1.5.1. The default config for the infiniband
is used. So infiniband is running in connected mode.

Our cluster is used to solve CFD (Computational Fluid Dynamics)
problems. And I'm trying to optimize the infiniband network and so I
have several questions:

- Is it the right mailing list to ask ? (if not...where should I post ?)

- Is there a how-to on infiniband optimisation ?

- CFD computations need a lot of bandwidth. There are a lot of data
exchange through MPI (we are using intel mpi). Has the infiniband mode
(connected or datagram) influence in this case ? What is the "best" MTU
for those computation ?


Best regards,
Guillaume

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] user SA notifications, redux

2010-10-14 Thread Mike Heinz

By providing this mechanism, ib_usa provides a convenient and useful way for 
user applications to detect when nodes enter or leave the fabric. This is 
useful for fabric monitoring applications (like a system admin dashboard) or, 
conceivably, to allow a job scheduler to react to problems and to dynamically 
balance the computational load across the fabric.

As a reminder - the original proposal also supported allowing user applications 
to join multicast groups, but there was talk about adding that to rdma_cm, 
instead.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Wednesday, October 13, 2010 11:59 AM
To: Mike Heinz; linux-r...@vger.kernel.org; e...@openfabrics.org
Cc: v...@mellanox.co.il; Roland Dreier
Subject: RE: user SA notifications, redux

> As I mentioned earlier, the reason ib_sa acts as a single access point for
> SA/SM traps and notices is because traps and notices are sent to ports, not
> to
> queue pairs and not to processes. That means only one entity can be
> subscribed
> for notices and traps at any particular time, and must manage them,
> "sharing
> them out" among all processes that are interested in them.

Can you provide a brief description of the intended usage model?

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] user SA notifications, redux

2010-10-13 Thread Mike Heinz

Way back in May I proposed this prototype for adding SA notifications to the 
verbs API, but no one ever said yes or no. Vlad - I'm not even sure you were 
part of the conversation at the time, it was originally about a new API, but 
you should be aware of it since the conversation changed to adding the 
user-space capability to libibverbs.

Now that 1.5.2 is out the door, can we revisit this and try to get this and the 
matching kernel changes into the next release?

===

API for Proposal for adding ib_usa to the Linux Infiniband Subsystem
Mike Heinz
Mon, 24 May 2010 12:31:16 -0700

I spent the weekend thinking about your feedback Friday, and I'm concerned that 
it widens the scope too far beyond what the current code is meant to do.

ib_usa isn't meant to be a general GSI interface, it's meant to be a user API 
for accessing the existing functionality of the existing ib_sa module. In 
particular, ib_sa and ib_usa provide a mechanism for other processes to share 
SA/SM notices and traps. 

As I mentioned earlier, the reason ib_sa acts as a single access point for 
SA/SM traps and notices is because traps and notices are sent to ports, not to 
queue pairs and not to processes. That means only one entity can be subscribed 
for notices and traps at any particular time, and must manage them, "sharing 
them out" among all processes that are interested in them.

Generalizing that to include other types of notices and traps would involve 
non-trivial changes to the ib_sa and might impact other parts of the infiniband 
subsystem, including the SM, since they would have to be rewritten to deal with 
the possibility that another component is now managing all notices and traps.

Below you will find a proposed API for accessing the notifications 
functionality of the existing ib_sa and ib_usa modules. This is pretty much 
exactly what we are currently using, but since Sean has suggested rdma_cm is 
better suited for multi-casting, they have been omitted.

Now, given that this API is stand-alone right now, it could still be added to 
either libibumad or to libibverbs - but I like Sean's suggestion that it be 
added to verbs, since the current security model restricts libibumad to root 
access and because the existing API already makes use of libibverbs' 
ibv_context data structure.

-- current ib_usa API  

/* InformInfo:TrapNumber */
enum {
IBV_SA_SM_TRAP_GID_IN_SERVICE  = __constant_cpu_to_be16(64),
IBV_SA_SM_TRAP_GID_OUT_OF_SERVICE  = __constant_cpu_to_be16(65),
IBV_SA_SM_TRAP_CREATE_MC_GROUP = __constant_cpu_to_be16(66),
IBV_SA_SM_TRAP_DELETE_MC_GROUP = __constant_cpu_to_be16(67),
IBV_SA_SM_TRAP_PORT_CHANGE_STATE   = 
__constant_cpu_to_be16(128),
IBV_SA_SM_TRAP_LINK_INTEGRITY  = 
__constant_cpu_to_be16(129),
IBV_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN= 
__constant_cpu_to_be16(130),
IBV_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED = 
__constant_cpu_to_be16(131),
IBV_SA_SM_TRAP_BAD_M_KEY   = 
__constant_cpu_to_be16(256),
IBV_SA_SM_TRAP_BAD_P_KEY   = 
__constant_cpu_to_be16(257),
IBV_SA_SM_TRAP_BAD_Q_KEY   = 
__constant_cpu_to_be16(258),
IBV_SA_SM_TRAP_ALL = 
__constant_cpu_to_be16(0x)
};

struct ibv_sa_event_channel;
struct ibv_sa_event;
struct ibv_sa_id;

/**
 * ibv_sa_create_event_channel - Open a channel used to report events.
 */
struct ibv_sa_event_channel *ibv_sa_create_event_channel();

/**
 * ibv_sa_destroy_event_channel - Close the event channel.
 * @channel: The channel to destroy.
 */
void ibv_sa_destroy_event_channel(struct ibv_sa_event_channel *channel);

/**
 * ibv_sa_get_event - Retrieves the next pending event, if no event is
 *   pending waits for an event.
 * @channel: Event channel to check for events.
 * @event: Allocated information about the next event.
 *Event should be freed using ibv_sa_ack_event()
 */
int ibv_sa_get_event(struct ibv_sa_event_channel *channel,
 struct ibv_sa_event **event);

/**
 * ibv_sa_ack_event - Free an event.
 * @event: Event to be released.
 *
 * All events which are allocated by ibv_sa_get_event() must be released,
 * there should be a one-to-one correspondence between successful gets
 * and acks.
 */
int ibv_sa_ack_event(struct ibv_sa_event *event);

/**
 * ibv_sa_register_inform_info - Registers to receive notice events.
 * @channel: Event channel to issue query on.
 * @device: Device associated with record.
 * @port_num: Port number of record.
 * @trap_number: InformInfo trap number to register for, in network byte
 *   order.
 * @context: User specified context associated with the registration.
 * @id: SA registration identifier.
 *
 * This call initiates a registration request with the SA for the specified
 * t

[ewg] [PATCH] Proposal for MAD Busy handling

2010-10-08 Thread Mike Heinz

Sean, Jason,

I backed off on this because the migration to OFED 1.5.2 and other issues was 
consuming all of my time; I've had this patch for quite a while but I finally 
had time recently to rework and test it for 1.5.2.

The intent of this patch is to try to address the feedback you gave me earlier 
this year. It does NOT implement the ABI/API changes that would be needed in 
user space to take advantage of the new features, but it lays the groundwork 
for doing so. In addition, it provides two new module parameters that allow the 
administrator to coerce existing code into using the new capabilities.

Initially, I had tried to completely separate BUSY retries from timeout 
handling, but that seemed difficult due to the way the timeout code is 
structured. As a result, true timeouts and busy handling still use the same 
timeout values, but I was still able to address the idea of randomizing the 
retry timeout if desired.

By default, the behavior of ib_mad wrt to BUSY responses is unchanged. If, 
however, a send work request is provided that has the new "busy_wait" parameter 
set, ib_mad will ignore BUSY responses to that WR, allowing it to timeout and 
retry as if no response had been received. 

In addition, if the send WR has the new "randomized_wait" parameter set, each 
time the WR times out, the the timeout for the next retry is set to 
(send_wr->timeout_ms + 511<<(send_wr->retries) - random32()&511). In other 
words, on the first retry, the randomization code will add between 0 and 1/2 
second to the timeout. On the second, it will add between 1 and 1.5 seconds to 
the timeout, on the 3rd, between 2 and 2.5 seconds, on the 4th, between 4 and 
4.5, et cetera. In addition, a new private field, total_timeout has been added 
to the WR and is initialized to (send_wr->timeout * send_wr->max_retries). 
Retry values are adjusted so that the total # of retry timeouts cannot exceed 
this value.

Finally, I've added two module parameters that coerce all mad work requests to 
use one or both of these settings:

parm:   treat_busy_as_timeout:When true, treat BUSY responses as if 
they were timeouts. (int)
parm:   randomized_wait:When true, use a randomized backoff algorithm 
to control retries for timeouts. (int)

As I mentioned in the past, these changes solve a problem we see in the real 
world all the time (the SM being pounded by "unintelligent" queries) so I 
strongly hope this meets your concerns and we can get it added to the next 
release of OFED.


-

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 64e660c..88ae047 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -41,6 +41,11 @@
 #include "smi.h"
 #include "agent.h"
 
+#include "linux/random.h"
+
+#define MAD_MIN_TIMEOUT_MS 511
+#define MAD_RAND_TIMEOUT_MS 511
+
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_DESCRIPTION("kernel IB MAD API");
 MODULE_AUTHOR("Hal Rosenstock");
@@ -54,6 +59,14 @@ MODULE_PARM_DESC(send_queue_size, "Size of send queue in 
number of work requests
 module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
 MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work 
requests");
 
+int mad_wait_on_busy = 0;
+module_param_named(treat_busy_as_timeout, mad_wait_on_busy, int, 0444);
+MODULE_PARM_DESC(treat_busy_as_timeout, "When true, treat BUSY responses as if 
they were timeouts.");
+
+int mad_randomized_wait = 0;
+module_param_named(randomized_wait, mad_randomized_wait, int, 0444);
+MODULE_PARM_DESC(randomized_wait, "When true, use a randomized backoff 
algorithm to control retries for timeouts.");
+
 static struct kmem_cache *ib_mad_cache;
 
 static struct list_head ib_mad_port_list;
@@ -1116,11 +1129,19 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
}
 
mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid;
+
+   mad_send_wr->randomized_wait = mad_randomized_wait || 
send_buf->randomized_wait;
+   mad_send_wr->total_timeout = 
msecs_to_jiffies(send_buf->timeout_ms) * send_buf->retries;
+   
/* Timeout will be updated after send completes */
mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms);
+
mad_send_wr->max_retries = send_buf->retries;
mad_send_wr->retries_left = send_buf->retries;
+   mad_send_wr->wait_on_busy = send_buf->wait_on_busy || 
mad_wait_on_busy;
+   
send_buf->retries = 0;
+   
/* Reference for work request to QP + response */
mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
mad_send_wr->status = IB_WC_SUCCESS;
@@ -1828,6 +1849,9 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
 
/* Complete corresponding request */
if (ib_response_mad(mad_recv_wc->recv_buf.mad)) {
+   u16 b

Re: [ewg] Binary files in libsdp SRPM in 1.5.2-rc7.

2010-09-21 Thread Mike Heinz

Resending this because I never saw it show up in the list:

Looking at the SRPMS, I noticed that libsdp doesn't seem to have been made from 
clean source. It contains the result of a configure and make operation:

Only in libsdp-1.1.103: config.h
Only in libsdp-1.1.103: config.log
Only in libsdp-1.1.103: config.status
Only in libsdp-1.1.103: libtool
Only in libsdp-1.1.103: Makefile
Only in libsdp-1.1.103/src: config_parser.lo
Only in libsdp-1.1.103/src: config_scanner.lo
Only in libsdp-1.1.103/src: libsdp.la
Only in libsdp-1.1.103/src: linux
Only in libsdp-1.1.103/src: log.lo
Only in libsdp-1.1.103/src: Makefile
Only in libsdp-1.1.103/src: match.lo
Only in libsdp-1.1.103/src: port.lo
Only in libsdp-1.1.103: stamp-h1

Is this going to cause a problem?

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Binary files in libsdp SRPM in 1.5.2-rc7.

2010-09-20 Thread Mike Heinz

Looking at the SRPMS, I noticed that libsdp doesn't seem to have been made from 
clean source. It contains the result of a configure and make operation:

Only in libsdp-1.1.103: config.h
Only in libsdp-1.1.103: config.log
Only in libsdp-1.1.103: config.status
Only in libsdp-1.1.103: libtool
Only in libsdp-1.1.103: Makefile
Only in libsdp-1.1.103/src: config_parser.lo
Only in libsdp-1.1.103/src: config_scanner.lo
Only in libsdp-1.1.103/src: libsdp.la
Only in libsdp-1.1.103/src: linux
Only in libsdp-1.1.103/src: log.lo
Only in libsdp-1.1.103/src: Makefile
Only in libsdp-1.1.103/src: match.lo
Only in libsdp-1.1.103/src: port.lo
Only in libsdp-1.1.103: stamp-h1

Is this going to cause a problem?

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [mpich2-dev] Problems with mvapich2-1.5.1 - shared library is missing hwloc_* functions.

2010-09-10 Thread Mike Heinz

BTW - in case it wasn't clear, this is the mvapich2-1.5.1 rpm that comes with 
OFED 1.5.2-rc6.

-Original Message-
From: mpich2-dev-boun...@mcs.anl.gov [mailto:mpich2-dev-boun...@mcs.anl.gov] On 
Behalf Of Mike Heinz
Sent: Friday, September 10, 2010 2:43 PM
To: mpich2-...@mcs.anl.gov; e...@openfabrics.org
Subject: [mpich2-dev] Problems with mvapich2-1.5.1 - shared library is missing 
hwloc_* functions.

Hello all,

I'm trying to build mvapich2-1.5.1 on an RHEL 5 update 3 system. It builds from 
the SRPM just fine, but when I try to compile test programs, they don't link. 
It appears that a set of routines, hwloc_* are missing from the shared library.

[r...@homer bandwidth]# /usr/mpi/gcc/mvapich2-1.5.1/bin/mpicc bw.c
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_obj_by_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_get_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_set_cpubind'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_init'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_cpuset_cpu'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_depth_type'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_type_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_nbobjs_by_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_compare_types'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_destroy'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_load'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_cpuset_alloc'
collect2: ld returned 1 exit status
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Problems with mvapich2-1.5.1 - shared library is missing hwloc_* functions.

2010-09-10 Thread Mike Heinz

Hello all,

I'm trying to build mvapich2-1.5.1 on an RHEL 5 update 3 system. It builds from 
the SRPM just fine, but when I try to compile test programs, they don't link. 
It appears that a set of routines, hwloc_* are missing from the shared library.

[r...@homer bandwidth]# /usr/mpi/gcc/mvapich2-1.5.1/bin/mpicc bw.c
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_obj_by_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_get_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_set_cpubind'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_init'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_cpuset_cpu'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_depth_type'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_type_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_get_nbobjs_by_depth'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_compare_types'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_destroy'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_topology_load'
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.so: undefined reference to 
`hwloc_cpuset_alloc'
collect2: ld returned 1 exit status

compared with:

[r...@homer bandwidth]# /usr/mpi/gcc/mvapich2-1.5.1/bin/mpicc  bw.c 
/usr/mpi/gcc/mvapich2-1.5.1/lib/libmpich.a
[r...@homer bandwidth]#

[r...@homer lib]# nm libmpich.so | grep hwloc_compare_types
 U hwloc_compare_types

[r...@homer lib]# nm libmpich.a | grep hwloc_compare_types
0020 T hwloc_compare_types

I'm building the binary RPM as follows:

[r...@homer SRPMS]# rpmbuild --rebuild
  --define '_topdir 
/home/mhein^Mnz/work/OFED_SOURCE/buildtemp/OFEDRPMS'
  --buildroot '/home/mheinz^Mz/work/OFED_SOURCE/buildtemp/build'
  --define 'build_root 
/home/mh^Mheinz/work/OFED_SOURCE/buildtemp/build'
  --target x86_64 --define ^M '_name mvapich2_gcc'
  --define 'compiler gcc'--define 'impl of^Mfa'
  --define 'open_ib_home /usr' --define '_usr /usr'
  --define 'comp_env CC=gcc CXX=g++ F77=gfortran F90=gfortran'
  --define 'auto_req 0'
  --define 'mpi_selector /usr/bin/mpi-sele^Mector'
  --define '_prefix /usr/mpi/gcc/mvapich2-1.5.1' 
  --define 'shaared_libs 1' 
  --define 'romio 1' 
  --define 'rdma --with-rdma=gen2'
  --define 'ib_include --with-ib-include=/usr/include'
  --define 'ib_libpath --with-ib-libpath=/usr/lib64'
  --define 'configure_options  --with-psm==no'
  mvapich2-*.src.rpm

Can someone tell me what I'm doing wrong?

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Problems building OFED 1.5.2 RC2 on RHEL5, SLES11. libsdp fails to configure.

2010-07-22 Thread Mike Heinz

It looks like a problem in the spec file for libsdp. The libsdp error message 
indicates that it needs to be passed a --with-openib option, but install.pl is 
passing in this:

Running  LDFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic 
-L/usr/ofed-1.5.2/lib64 -L/usr/ofed-1.5.2/lib' CFLAGS='-O2 -g -pipe -Wall 
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector 
--param=ssp-buffer-size=4 -m64 -mtune=generic -I/usr/ofed-1.5.2/include' 
CPPFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic 
-I/usr/ofed-1.5.2/include' rpmbuild --rebuild  --define '_topdir 
/var/tmp//OFED_topdir' --define 'dist %{nil}' --target x86_64 --define '_prefix 
/usr/ofed-1.5.2' --define '_exec_prefix /usr/ofed-1.5.2' --define '_sysconfdir 
/etc' --define '_usr /usr/ofed-1.5.2' 
/home/mheinz/work/OFED-1.5.2-rc2/SRPMS/libsdp-1.1.101-0.3.gc767eee.src.rpm

-Original Message-
From: ewg-boun...@openfabrics.org [mailto:ewg-boun...@openfabrics.org] On 
Behalf Of Mike Heinz
Sent: Thursday, July 22, 2010 3:05 PM
To: e...@openfabrics.org
Subject: [ewg] Problems building OFED 1.5.2 RC2 on RHEL5, SLES11. libsdp fails 
to configure.

Hey, all - I'm trying to install the 1.5.2-rc2 tarball with the following 
command:

# ./install.pl --all --prefix /usr/ofed-1.5.2-rc2

but it fails when it gets to libsdp:

configure: error: OPENIB: --with-openib must be provided - fail to find 
standard OpenIB kernel installation
error: Bad exit status from /var/tmp/rpm-tmp.53660 (%build)

I'm experiencing this problem with every distro I've tried.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Problems building OFED 1.5.2 RC2 on RHEL5, SLES11. libsdp fails to configure.

2010-07-22 Thread Mike Heinz

Hey, all - I'm trying to install the 1.5.2-rc2 tarball with the following 
command:

# ./install.pl --all --prefix /usr/ofed-1.5.2-rc2

but it fails when it gets to libsdp:

configure: error: OPENIB: --with-openib must be provided - fail to find 
standard OpenIB kernel installation
error: Bad exit status from /var/tmp/rpm-tmp.53660 (%build)

I'm experiencing this problem with every distro I've tried.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] pkey fix for ipoib - resubmission

2010-06-18 Thread Mike Heinz

I never got a response to this patch, so I'm sending it again.

-

IPoIB is coded to use the 1st PKey in the PKey table as its ib0 interface. 
Additional ib0.pkey interfaces may be created using the /sys/class/... 
add_child interface.

However, there is a race.  During normal boot, IPoIB will be started before the 
port is Active.  Hence the pkey table has not yet been programmed and has a 
default pkey table (with 0x as only pkey).

Later when the SM moves the port to Active, the SM may program the pkey table 
differently.  However at this point IPoIB has already started using the 
incorrect pkey.

It appears that the initially formatted 'broadcast' mgid is never updated to 
supply actual pkey value if ipoib comes up before hca port. Proposed patch 
targets two issues:

1. Suppress activation of interface and join multicast group queries (it will 
fail anyway) until hca port is initialized. When port becomes active - update 
pkey value and move on.
2. Update broadcast mgid based on actual pkey, then issue join broadcast group 
request.

Signed-Off-By: Michael Heinz 

---
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index ec6b4fb..496d96c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -51,6 +51,7 @@ MODULE_PARM_DESC(data_debug_level,
 #endif
 
 static DEFINE_MUTEX(pkey_mutex);
+static void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 struct ib_pd *pd, struct ib_ah_attr *attr) @@ 
-654,12 +655,13 @@ int ipoib_ib_dev_open(struct net_device *dev)
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret;
 
-   if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index)) {
+   ipoib_pkey_dev_check_presence(dev);
+
+   if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
ipoib_warn(priv, "P_Key 0x%04x not found\n", priv->pkey);
clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
return -1;
}
-   set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 
ret = ipoib_init_qp(dev);
if (ret) {
@@ -694,9 +696,26 @@ int ipoib_ib_dev_open(struct net_device *dev)  static void 
ipoib_pkey_dev_check_presence(struct net_device *dev)  {
struct ipoib_dev_priv *priv = netdev_priv(dev);
-   u16 pkey_index = 0;
+   struct ib_port_attrport_attr;
+
+   if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
+   clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+   if (ib_query_port(priv->ca, priv->port, &port_attr)) {
+   ipoib_warn(priv, "Query port attrs failed\n");
+   return;
+   }
+
+   if (port_attr.state != IB_PORT_ACTIVE)
+   return;
+
+   if (ib_query_pkey(priv->ca, priv->port, 0, &priv->pkey)) {
+   ipoib_warn(priv, "Query P_Key table entry 0 failed\n");
+   return;
+   }
+   set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
+   }
 
-   if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &pkey_index))
+   if (ib_find_pkey(priv->ca, priv->port, priv->pkey, &priv->pkey_index))
clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
else
set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags); @@ -955,7 +974,8 @@ 
static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
}
 
/* restart QP only if P_Key index is changed */
-   if (test_and_set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) &&
+   if (test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags) &&
+   test_and_set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags) &&
new_index == priv->pkey_index) {
ipoib_dbg(priv, "Not flushing - P_Key index not 
changed.\n");
return;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 3871ac6..6fe6527 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -552,6 +552,13 @@ void ipoib_mcast_join_task(struct work_struct *work)
}
 
spin_lock_irq(&priv->lock);
+
+   if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
+   /* fix broadcast gid in case if pkey was changed */
+   priv->pkey |= 0x8000;
+   priv->dev->broadcast[8] = priv->pkey >> 8;
+   priv->dev->broadcast[9] = priv->pkey & 0xff;
+   }
memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
   sizeof (union ib_gid));
priv->broadcast = broadcast;
___
ewg

[ewg] [PATCH] ofa_kernel openibd script

2010-06-16 Thread Mike Heinz

This patch builds upon my previously submitted patch for improving the default 
handling of the  node_desc.

With this patch, the openibd script will set the description of each HCA in the 
system to the value "@: HCA-##" where "##" is replaced with a unique id number 
for that HCA and the "@" symbol is automatically replaced by the HCA whenever 
the HCA's node description is queried. For example:

r...@bart:~# cat /sys/class/infiniband/mthca0/node_desc
@: HCA-1

[r...@panic ~]# smpquery ND 6
Node Description: bart: HCA-1


Again, this patch is only effective when combined with my previously submitted 
node_desc patch.

Signed-Off-By: Michael Heinz 

-

diff --git a/ofed_scripts/openibd b/ofed_scripts/openibd
index fa65611..447e1a8 100755
--- a/ofed_scripts/openibd
+++ b/ofed_scripts/openibd
@@ -898,8 +898,8 @@ if [ -d \${IBSYSDIR} ]; then
 for hca in \${IBSYSDIR}/*
 do
 if [ -e \${hca}/node_desc ]; then
-logger -i "Set node_desc for \$(basename \$hca): \$(hostname -s) 
HCA-\${hca_id}"
-echo -n "\$(hostname -s) HCA-\${hca_id}" >> \${hca}/node_desc
+logger -i "Set node_desc for HCA-\${hca_id}"
+echo -n "@: HCA-\${hca_id}" >> \${hca}/node_desc
 fi
 let hca_id++
 done
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH v2] OFED 1.5.2 ofa_kernel node_description patch

2010-06-15 Thread Mike Heinz

This is the OFED 1.5.2 version of a patch I submitted earlier today to 
linux-rdma. There are only very small differences between OFED 1.5.2 and 
matching areas of the IB drivers in Linux 2.6.35, but they were enough to break 
the patch, making this version necessary.

If this patch is accepted for 1.5.2, I will also submit the matching patch to 
/etc/init.d/openibd.

Currently, the node description of an HCA is set to a description of the HCA 
hardware or, at boot time, to a brief string containing the hostname of the 
node the HCA is installed in.

The problem is that if the host's DHCP server is slow, the node description may 
be set before the hostname, resulting in an entire fabric of nodes called 
"localhost".

This fix adds a small parsing function to the core infiniband code and a hook 
in each of the HCA drivers so that, at the time the HCA is actually queried for 
its node description, the description is scanned for an '@' character which is 
then replaced with the utsname of the node. This ensures that even if the 
hostname is initially set incorrectly, if it later changes the HCA will report 
the updated information.

In addition, the initialization code for HCA drivers that preset the node_desc 
has been patched to include an '@' character at the beginning of the 
description. This eliminates the need for a special initialization script - 
although existing scripts are still supported.

This updated patch incorporates feedback from Jason Gunthorpe and Or Gerlitz.

Signed-Off-By: Michael Heinz 

---

Testing on Mellanox HCA, case 1 (default):

r...@bart:~# cat /sys/class/infiniband/mthca0/node_desc
@:MT25218 InfiniHostEx Mellanox Technologies

[r...@panic ~]# smpquery ND 6
Node Description: bart:MT25218 InfiniHostEx Mellanox Technologies


Testing on Mellanox HCA, case 2 - over 64 characters long:

r...@bart:~# echo 
"0123456789112345678921234567...@234567894123456789512345678961234567897" 
>/sys/class/infiniband/mthca0/node_desc
r...@bart:~# cat /sys/class/infiniband/mthca0/node_desc
0123456789112345678921234567...@23456789412345678951234567896123

[r...@panic sbin]# smpquery ND 6
Node 
Description:.0123456789112345678921234567893bart2345678941234567895123456789


Testing on Mellanox HCA, case 3 - short:

r...@bart:~# echo "@" >/sys/class/infiniband/mthca0/node_desc

[r...@panic sbin]# smpquery ND 6
Node Description:...bart

--

Testing with QIB HCA:

[r...@node-b2 ~]# cat /sys/class/infiniband/qib0/node_desc
@:QLogic kernel.org driver

[r...@node-a1 ~]# smpquery ND 0x140
Node Description:.node-b2:QLogic kernel.org driver


[r...@node-b2 1]# cat /sys/class/infiniband/qib0/node_desc
@

[r...@node-a1 ~]# smpquery ND 0x140
Node Description:.node-b2


[r...@node-b2 ~]# echo 
"0123456789112345678921234567...@234567894123456789512345678961234567897" 
>/sys/class/infiniband/qib0/node_desc

[r...@node-a1 ~]# smpquery ND 0x140
Node 
Description:.0123456789112345678921234567893node-b22345678941234567895123456


---

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index ef1304f..bdf1cfa 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -41,6 +41,7 @@
 #include "mad_rmpp.h"
 #include "smi.h"
 #include "agent.h"
+#include "linux/utsname.h"
 
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_DESCRIPTION("kernel IB MAD API");
@@ -932,6 +933,29 @@ int ib_get_mad_data_offset(u8 mgmt_class)
 }
 EXPORT_SYMBOL(ib_get_mad_data_offset);
 
+#define NODE_DESC_FIELD_LENGTH 64
+void ib_build_node_desc(char *dest, char *src)
+{
+   int i;
+   for (i=0; inodename;
+   for (; *name && *name != '.' && 
iattr_mod)
smp->status |= IB_SMP_INVALID_FIELD;
 
-   strncpy(smp->data, ibdev->node_desc, sizeof(smp->data));
+   ib_build_node_desc((char*)smp->data, ibdev->node_desc);
 
return reply(smp);
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c 
b/drivers/infiniband/hw/ipath/ipath_verbs.c
index dd7f26d..db8b719 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -2180,7 +2180,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
dev->dma_ops = &ipath_dma_mapping_ops;
 
snprintf(dev->node_desc, sizeof(dev->node_desc),
-IPATH_IDSTR " %s", init_utsname()->nodename);
+"@:" IPATH_IDSTR);
 
ret = ib_register_device(dev, NULL);
if (ret)
diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index f38d5b1..d83398f 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -196,7 +196,7 @@ static void node_desc_override(struct ib_device *dev,
mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP &&
mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) {
spin_lock(&to_mdev(dev)->sm_lock);
-   memcpy(((struct ib_smp *) mad)->data, dev-

Re: [ewg] [PATCH] ofa_kernel madeye.c

2010-06-14 Thread Mike Heinz

Thanks!

From: Vladimir Sokolovsky [v...@dev.mellanox.co.il]
Sent: Sunday, June 13, 2010 5:01 AM
To: Mike Heinz
Cc: e...@openfabrics.org
Subject: Re: [ewg] [PATCH] ofa_kernel madeye.c

Mike Heinz wrote:
> This is a simple fix. Several of the snoop filters in 
> ./drivers/infiniband/util/madeye.c don't switch the attribute id to host byte 
> order before checking it.
>
> Signed-off-by: Michael Heinz 
>

Applied,

Regards,
Vladimir
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] node description patch

2010-06-11 Thread Mike Heinz

Jack - for some reason your reply dropped into my junk email box instead of my 
EWG folder.

I hate Outlook.

Anyway - It looks like NODE_DESC_HOSTNAME is set in /etc/infiniband/openib.conf 
but, you're right, there should be a fallback in the script. I'll make the 
change and resubmit. I'll also add code to force a trailing zero.

-Original Message-
From: Jack Morgenstein [mailto:ja...@dev.mellanox.co.il] 
Sent: Thursday, June 03, 2010 6:14 AM
To: e...@openfabrics.org
Cc: Mike Heinz; e...@openfabrics.org
Subject: Re: [ewg] [PATCH] node description patch

On Tuesday 01 June 2010 17:04, Mike Heinz wrote:
> +            logger -i "Set node_desc for ${hca}: ${NODE_DESC_HOSTNAME} 
> HCA-\${hca_id}"
> +            echo -n "${NODE_DESC_HOSTNAME} HCA-${hca_id}" >> 
> ${sysdir}/${hca}/node_desc
> 

I don't see NODE_DESC_HOSTNAME defined anywhere.

Don't you need a line like:
> +echo -n "${NODE_DESC_HOSTNAME:-...@} HCA-${hca_id}" >> 
> ${sysdir}/${hca}/node_desc

So that you will have the "@" that the driver looks for if NODE_DESC_HOSTNAME 
is not set
in the environment, or will use some other string if the user
wishes to override by setting NODE_DESC_HOSTNAME to something else in the 
environment?

Also, I am concerned about the trailing \0 character. It looks to me that your 
function
ib_build_node_desc does not deal with this.  For example, if the node name is 
very long,
and the 64 bytes does not reach the end of the string stored in sysfs due to 
the length
difference when replacing "@" with the node name.

-Jack

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] ofa_kernel madeye.c

2010-06-11 Thread Mike Heinz

I've submitted this patch to EWG twice now, without acknowledgement. This is 
the third time - there is a bug in the madeye module included in OFED 1.5.2 
that breaks the filter functions on x86 processors because the headers are in 
network, not host, byte order.

This patch fixes the bug and should be included in OFED 1.5.2.

-Original Message-
From: ewg-boun...@openfabrics.org [mailto:ewg-boun...@openfabrics.org] On 
Behalf Of Mike Heinz
Sent: Tuesday, June 01, 2010 9:58 AM
To: e...@openfabrics.org
Subject: [ewg] [PATCH] ofa_kernel madeye.c

I'm resending this, because it seems to have been overlooked.

The linux-rdma group does not feel madeye should be added to the upstream 
kernel, but there are still bugs in the version of madeye that we include in 
OFED. This patch should be applied to the OFED version of madeye.c.

-Original Message-
From: ewg-boun...@openfabrics.org [mailto:ewg-boun...@openfabrics.org] On 
Behalf Of Mike Heinz
Sent: Wednesday, May 26, 2010 4:01 PM
To: e...@openfabrics.org
Subject: [ewg] [PATCH] ofa_kernel madeye.c

This is a simple fix. Several of the snoop filters in 
./drivers/infiniband/util/madeye.c don't switch the attribute id to host byte 
order before checking it. 

Signed-off-by: Michael Heinz 

diff --git a/drivers/infiniband/util/madeye.c b/drivers/infiniband/util/madeye.c
index 0cda06c..2c650a3 100644
--- a/drivers/infiniband/util/madeye.c
+++ b/drivers/infiniband/util/madeye.c
@@ -401,7 +401,7 @@ static void snoop_smi_handler(struct ib_mad_agent 
*mad_agent,
 
if (!smp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && hdr->attr_id != attr_id)
+   if (attr_id && be16_to_cpu(hdr->attr_id) != attr_id)
return;
 
printk("Madeye:sent SMP\n");
@@ -413,7 +413,7 @@ static void recv_smi_handler(struct ib_mad_agent *mad_agent,
 {
if (!smp && mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class != mgmt_class)
return;
-   if (attr_id && mad_recv_wc->recv_buf.mad->mad_hdr.attr_id != attr_id)
+   if (attr_id && be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) 
!= attr_id)
return;
 
printk("Madeye:recv SMP\n");
@@ -446,7 +446,7 @@ static void snoop_gsi_handler(struct ib_mad_agent 
*mad_agent,
 
if (!gmp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && hdr->attr_id != attr_id)
+   if (attr_id && be16_to_cpu(hdr->attr_id) != attr_id)
return;
 
printk("Madeye:sent GMP\n");
@@ -468,7 +468,7 @@ static void recv_gsi_handler(struct ib_mad_agent *mad_agent,
 
if (!gmp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && mad_recv_wc->recv_buf.mad->mad_hdr.attr_id != attr_id)
+   if (attr_id && be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) 
!= attr_id)
return;
 
printk("Madeye:recv GMP\n");
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] Handling busy responses from the SA

2010-06-08 Thread Mike Heinz

It's workable, although I really wish there was a way to handle "stupid" apps 
that aren't written to handle a busy response.

-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Tuesday, June 08, 2010 12:44 PM
To: Jason Gunthorpe
Cc: Mike Heinz; linux-r...@vger.kernel.org; e...@openfabrics.org
Subject: RE: [PATCH] Handling busy responses from the SA

> Also, I guess, it would be a good API choice if the caller could say
> 'get me a reply for this mad or error within 60s' rather than specify
> details like retry counts, etc. The timeout values should be globally
> set and derived from the usual SA provided data for network transits...

I agree with this.  Within the framework of the existing umad ABI, this could 
be specified by setting the high bit in the ib_user_mad_hdr:timeout_ms field, 
assuming that no one is using that bit in practice.  The kernel could then 
freely select the retry/timeout policy for these clients, which for starters 
could include dropping BUSY responses and adjusting the timeout using an 
approach similar to what Mike mentioned in a separate email.  Kernel clients 
could be updated to use this new mode.

Any disagreements to this approach?  
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz

> But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) 

In that case, what is the purpose of the BUSY response? 

-Original Message-
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] 
Sent: Friday, June 04, 2010 6:58 PM
To: Hefty, Sean
Cc: Mike Heinz; linux-r...@vger.kernel.org; e...@openfabrics.org
Subject: Re: [PATCH] Handling busy responses from the SA

On Fri, Jun 04, 2010 at 02:05:10PM -0700, Hefty, Sean wrote:

> Maybe we should re-think that guideline and allow users to simply
> indicate that the MAD layer should use reasonable defaults.  This
> would enable the ib_mad module to adjust the timeout values for all
> consumers based on actual destination response times.  It could also
> back off retrying multiple requests that were initiated around the
> same time, instead only retrying the first request, while simply
> increasing the timeout values for the others.  This is more complex,
> but we should be able to start with something fairly simple.

A common method for handling this sort of thing is to randomize
the retry timeout. It would be a good idea to randomize all timeouts,
but the BUSY replies should probably randomize over a longer time
period.

Randomization prevents nodes in the cluster from self-synchronizing
and making the load on the SA worse.

But, I also agree with Roland.. having the SA return busy when it is
under load seems insane :) But if you really want to do this then I
think a different, larger, timeout should be used than the standard
mad timeout.

Also, I guess, it would be a good API choice if the caller could say
'get me a reply for this mad or error within 60s' rather than specify
details like retry counts, etc. The timeout values should be globally
set and derived from the usual SA provided data for network transits...

Jason
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz

Sean said:
> I don't object to the concept of treating a busy response as a timeout, but 
> how does this help prevent overwhelming the SA?  It continues to retry the 
> queries, even if the SA says that it's too busy to respond without adjusting 
> the timeout specified by the user.  I would think that you'd at least want to 
> adjust the timeout (double it or use some random backoff).


Well, the current behavior is to simply return the BUSY to the client or ULP, 
which  is either treated as a permanent error or causes an immediate retry. 
This can be a big problem with, for example, ipoib which sets retries to 15 and 
(as I understand it) immediately retries to connect when getting an error 
response from the SA. Other ulps have similar settings. Without some kind of 
delay, starting up ipoib on a large fabric (at boot time, for example) can 
cause a real packet storm. 

By treating BUSY replies identically to timeouts, this patch at least 
introduces a delay between attempts. In the case of the ULPs, the delay is 
typically 4 seconds.

Sean said:
> The general guideline that we've been using for adjusting timeouts has been 
> to report the failures and let the caller make the a necessary adjustments.  
> As far as I know, the only way for user space applications to query the SA 
> are through the librdmacm, which sets retries to 0, or through the libibumad 
> interface directly.  I would expect any application using the latter to be 
> intelligent enough to handle a busy response.


And this approach encourages applications to adjust their timeouts 
appropriately by treating BUSY responses as non-events and forcing the 
applications to wait for their request to time out.

Depending on the application developers to take BUSY responses into account 
seems to be asking for trouble - it allows one rogue app to bring the SA to its 
knees, for example. By enforcing this timeout model in the kernel, we guarantee 
that there will be at least some delay between each message when the SA is 
reporting a busy status. And as I previously mentioned this patch also affects 
kernel code, much of which does use retries.

Sean said:
> Maybe we should re-think that guideline and allow users to simply indicate 
> that the MAD layer should use reasonable defaults.  This would enable the 
> ib_mad module to adjust the timeout values for all consumers based on actual 
> destination response times.  It could also back off retrying multiple 
> requests that were initiated around the same time, instead only retrying the 
> first request, while simply increasing the timeout values for the others.  
> This is more complex, but we should be able to start with something fairly 
> simple.

It's an interesting idea, but in the meantime this is a problem that affects 
large clusters today.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] Handling busy responses from the SA

2010-06-07 Thread Mike Heinz

Roland Dreier said:

> I don't have a strong opinion on this but it seems a bit odd.  If we're just 
> going to drop the response anyway, why did the SA send it in the first place? 
>  On the other hand, if the SA told us it's busy, it does seem we could do 
> something more sensible than retrying immediately.

The spec provides for the SA to return a BUSY response. When that happens, this 
patch causes us to wait for the original request to time out before retrying, 
not trying again immediately. In effect, we are pretending we never got the 
BUSY response and allowing the request to time out, instead.

Roland Dreier said:

> The indentation of values seems pretty crazy here.  Also I'm not sure what 
> most of these defines are for?  They seem unused in this patch.

The indentation is probably from the conversion of tabs to spaces when the 
patch was pasted into the email - correcting it is no problem.  The value 
IB_MGMT_MAD_STATUS_BUSY is used in the patch, the others are defined because 
they are the other possible values for the same status field. We might as well 
define them all, for completeness.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] Handling busy responses from the SA

2010-06-04 Thread Mike Heinz

The purpose of this patch is to cause the ib_mad driver to discard busy 
responses from the SA, effectively causing busy responses to become time outs.

This ensures that naïve IB applications cannot overwhelm the SA with queries, 
which could happen when a cluster is being rebooted, or when a large HPC 
application is started.

Note that this patch directly changes the same code affected by the mad user 
rmpp patch - it cannot be successfully applied without that patch.

Signed-Off-By: Michael Heinz 



diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c 
index efca783..05f2930 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1815,9 +1815,20 @@ static void ib_mad_complete_recv(struct 
ib_mad_agent_private *mad_agent_priv,
 */
/* Complete corresponding request */
if (ib_response_mad(mad_recv_wc->recv_buf.mad)) {
+   u16 busy = 
__be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.status) &
+   IB_MGMT_MAD_STATUS_BUSY;
+
spin_lock_irqsave(&mad_agent_priv->lock, flags);
mad_send_wr = ib_find_send_mad(mad_agent_priv, mad_recv_wc);
if (mad_send_wr) {
+   if (busy && mad_send_wr->retries_left) {
+   /* Just let the query timeout and have it 
requeued later */
+   spin_unlock_irqrestore(&mad_agent_priv->lock, 
flags);
+   ib_free_recv_mad(mad_recv_wc);
+   deref_mad_agent(mad_agent_priv);
+   printk(KERN_NOTICE PFX "Response returned with 
MAD_STATUS_BUSY\n");
+   return;
+   }
ib_mark_mad_done(mad_send_wr);
spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h index 
2651e93..e9dc4cc 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -77,6 +77,15 @@
 
 #define IB_MGMT_MAX_METHODS128
 
+/* MAD Status field bit masks */
+#define IB_MGMT_MAD_STATUS_SUCCESS 
0x
+#define IB_MGMT_MAD_STATUS_BUSY
0x0001
+#define IB_MGMT_MAD_STATUS_REDIRECT_REQD   0x0002
+#define IB_MGMT_MAD_STATUS_BAD_VERERSION   0x0004  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD  0x0008  
+#define IB_MGMT_MAD_STATUS_UNSUPPORTED_METHOD_ATTRIB   0x000c
+#define IB_MGMT_MAD_STATUS_INVALID_ATTRIB_VALUE0x001c
+
 /* RMPP information */
 #define IB_MGMT_RMPP_VERSION   1
 #define IB_MGMT_RMPP_PASSTHRU  255
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] node description patch

2010-06-01 Thread Mike Heinz

This patch fixes a problem with the openibd initialization script. On machines 
using slower DHCP servers, openibd frequently sets the HCA's node description 
to HCA-1. This patch modifies openibd to add a "@" instead of the hostname and 
adds a small hook in the core drivers to replace the "@" sign with the system's 
utsname().

Because this patch depends on changes to openibd, it cannot be submitted to the 
upstream kernel, but it still corrects an outstanding issue with OFED 1.5.

Signed-Off-By: Michael Heinz 

---
 drivers/infiniband/core/mad.c   |   18 ++
 drivers/infiniband/hw/ipath/ipath_mad.c |2 -
 drivers/infiniband/hw/mlx4/mad.c|2 -
 drivers/infiniband/hw/mthca/mthca_mad.c |2 -
 include/rdma/ib_mad.h   |8 ++
 ofed_scripts/openibd|   41 ++--
 6 files changed, 47 insertions(+), 26 deletions(-)

Index: ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/core/mad.c
===
--- ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2.orig/drivers/infiniband/core/mad.c
+++ ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/core/mad.c
@@ -39,6 +39,7 @@
 #include "mad_rmpp.h"
 #include "smi.h"
 #include "agent.h"
+#include "linux/utsname.h"
 
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_DESCRIPTION("kernel IB MAD API");
@@ -929,6 +930,23 @@ int ib_get_mad_data_offset(u8 mgmt_class
 }
 EXPORT_SYMBOL(ib_get_mad_data_offset);
 
+void ib_build_node_desc(char *dest, char *src)
+{
+   int i;
+   for (i=0; i<64;) {
+   if (*src == '@') {
+   char *name = init_utsname()->nodename;
+   for (; *name && *name != '.' && i<64; ++i)
+   *dest++ = *name++;
+   src++;
+   } else {
+   *dest++ = *src++;
+   i++;
+   }
+   }
+}
+EXPORT_SYMBOL(ib_build_node_desc);
+
 int ib_is_mad_class_rmpp(u8 mgmt_class)
 {
if ((mgmt_class == IB_MGMT_CLASS_SUBN_ADM) ||
Index: 
ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/hw/ipath/ipath_mad.c
===
--- 
ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2.orig/drivers/infiniband/hw/ipath/ipath_mad.c
+++ 
ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/hw/ipath/ipath_mad.c
@@ -60,7 +60,7 @@ static int recv_subn_get_nodedescription
if (smp->attr_mod)
smp->status |= IB_SMP_INVALID_FIELD;
 
-   strncpy(smp->data, ibdev->node_desc, sizeof(smp->data));
+   ib_build_node_desc((char*)smp->data, ibdev->node_desc);
 
return reply(smp);
 }
Index: ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/hw/mlx4/mad.c
===
--- ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2.orig/drivers/infiniband/hw/mlx4/mad.c
+++ ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/hw/mlx4/mad.c
@@ -195,7 +195,7 @@ static void node_desc_override(struct ib
mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP &&
mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) {
spin_lock(&to_mdev(dev)->sm_lock);
-   memcpy(((struct ib_smp *) mad)->data, dev->node_desc, 64);
+   ib_build_node_desc((char*)((struct ib_smp *) mad)->data, 
dev->node_desc);
spin_unlock(&to_mdev(dev)->sm_lock);
}
 }
Index: 
ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/hw/mthca/mthca_mad.c
===
--- 
ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2.orig/drivers/infiniband/hw/mthca/mthca_mad.c
+++ 
ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/drivers/infiniband/hw/mthca/mthca_mad.c
@@ -153,7 +153,7 @@ static void node_desc_override(struct ib
mad->mad_hdr.method == IB_MGMT_METHOD_GET_RESP &&
mad->mad_hdr.attr_id == IB_SMP_ATTR_NODE_DESC) {
mutex_lock(&to_mdev(dev)->cap_mask_mutex);
-   memcpy(((struct ib_smp *) mad)->data, dev->node_desc, 64);
+   ib_build_node_desc((char*)((struct ib_smp *) mad)->data, 
dev->node_desc);
mutex_unlock(&to_mdev(dev)->cap_mask_mutex);
}
 }
Index: ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/include/rdma/ib_mad.h
===
--- ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2.orig/include/rdma/ib_mad.h
+++ ofa_kernel-1.4.1_QIB-2.6.16_sles10_sp2/include/rdma/ib_mad.h
@@ -637,6 +637,14 @@ int ib_is_mad_class_rmpp(u8 mgmt_class);
 int ib_get_mad_data_offset(u8 mgmt_class);
 
 /**
+ * ib_build_node_desc - copies the node description and replaces
+ * any @ markers with the present system node name.
+ * @dest: destination
+ * @src: source
+ */
+void ib_build_node_desc(char *dest, char *src);
+
+/**
  * ib_get_rmpp_segment - returns the data buffer for a given RMPP s

[ewg] [PATCH] ofa_kernel madeye.c

2010-06-01 Thread Mike Heinz

I'm resending this, because it seems to have been overlooked.

The linux-rdma group does not feel madeye should be added to the upstream 
kernel, but there are still bugs in the version of madeye that we include in 
OFED. This patch should be applied to the OFED version of madeye.c.

-Original Message-
From: ewg-boun...@openfabrics.org [mailto:ewg-boun...@openfabrics.org] On 
Behalf Of Mike Heinz
Sent: Wednesday, May 26, 2010 4:01 PM
To: e...@openfabrics.org
Subject: [ewg] [PATCH] ofa_kernel madeye.c

This is a simple fix. Several of the snoop filters in 
./drivers/infiniband/util/madeye.c don't switch the attribute id to host byte 
order before checking it. 

Signed-off-by: Michael Heinz 

diff --git a/drivers/infiniband/util/madeye.c b/drivers/infiniband/util/madeye.c
index 0cda06c..2c650a3 100644
--- a/drivers/infiniband/util/madeye.c
+++ b/drivers/infiniband/util/madeye.c
@@ -401,7 +401,7 @@ static void snoop_smi_handler(struct ib_mad_agent 
*mad_agent,
 
if (!smp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && hdr->attr_id != attr_id)
+   if (attr_id && be16_to_cpu(hdr->attr_id) != attr_id)
return;
 
printk("Madeye:sent SMP\n");
@@ -413,7 +413,7 @@ static void recv_smi_handler(struct ib_mad_agent *mad_agent,
 {
if (!smp && mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class != mgmt_class)
return;
-   if (attr_id && mad_recv_wc->recv_buf.mad->mad_hdr.attr_id != attr_id)
+   if (attr_id && be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) 
!= attr_id)
return;
 
printk("Madeye:recv SMP\n");
@@ -446,7 +446,7 @@ static void snoop_gsi_handler(struct ib_mad_agent 
*mad_agent,
 
if (!gmp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && hdr->attr_id != attr_id)
+   if (attr_id && be16_to_cpu(hdr->attr_id) != attr_id)
return;
 
printk("Madeye:sent GMP\n");
@@ -468,7 +468,7 @@ static void recv_gsi_handler(struct ib_mad_agent *mad_agent,
 
if (!gmp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && mad_recv_wc->recv_buf.mad->mad_hdr.attr_id != attr_id)
+   if (attr_id && be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) 
!= attr_id)
return;
 
printk("Madeye:recv GMP\n");
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Question: When should patches be submitted to EWG and when should they be submitted to linux-rdma?

2010-05-27 Thread Mike Heinz

Thanks to everyone for the feedback and the explanation. I'm going to save off 
Woody's answer to our in house wiki so my co-workers won't have the same 
confusion.

-Original Message-
From: Woodruff, Robert J [mailto:robert.j.woodr...@intel.com] 
Sent: Wednesday, May 26, 2010 6:58 PM
To: Mike Heinz; openfabrics-...@openib.org
Subject: RE: [ewg] Question: When should patches be submitted to EWG and when 
should they be submitted to linux-rdma?

In general, we would like kernel code to be reviewed and accepted (or at least 
queued for
acceptance) upstream first and then submitted to to the ewg for the next OFED 
release.

There are sometimes exceptions where things go into OFED before being accepted
upstream but in general, we would like to follow the model where they are 
submitted upsteam first if possible.

Some things, like backport patches or OFED installation scripts, 
are only mainatained by the EWG, so in those cases, they only need to 
be submitted to the EWG list.

Hope this helps.

woody

-Original Message-
From: ewg-boun...@lists.openfabrics.org 
[mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Mike Heinz
Sent: Wednesday, May 26, 2010 1:34 PM
To: openfabrics-...@openib.org
Subject: [ewg] Question: When should patches be submitted to EWG and when 
should they be submitted to linux-rdma?

The subject says it all. If I have a patch that can be applied against either 
the current OFED git repository or against the upstream kernel - where do I 
post it?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Question: When should patches be submitted to EWG and when should they be submitted to linux-rdma?

2010-05-26 Thread Mike Heinz

My preference for bug fixes is that they be applied so that they go into the 
upstream kernel - assuming they don't require EWG-only changes. But I need to 
understand the correlation between the two source trees - if you accept a bug 
fix for the upstream kernel, will that end up in OFED as well, or do I need to 
submit the patch to both groups? 

-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Wednesday, May 26, 2010 4:50 PM
To: Mike Heinz
Cc: openfabrics-...@openib.org
Subject: Re: [ewg] Question: When should patches be submitted to EWG and when 
should they be submitted to linux-rdma?

 > The subject says it all. If I have a patch that can be applied
 > against either the current OFED git repository or against the
 > upstream kernel - where do I post it?

What do you want to happen to the patch?  If you want it applied to the
upstream kernel, then send it to me and linux-rdma.  If you want it
applied to an OFED tree, send it to ewg.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Question: When should patches be submitted to EWG and when should they be submitted to linux-rdma?

2010-05-26 Thread Mike Heinz

The subject says it all. If I have a patch that can be applied against either 
the current OFED git repository or against the upstream kernel - where do I 
post it?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] ofa_kernel madeye.c

2010-05-26 Thread Mike Heinz

This is a simple fix. Several of the snoop filters in 
./drivers/infiniband/util/madeye.c don't switch the attribute id to host byte 
order before checking it. 

Signed-off-by: Michael Heinz 

diff --git a/drivers/infiniband/util/madeye.c b/drivers/infiniband/util/madeye.c
index 0cda06c..2c650a3 100644
--- a/drivers/infiniband/util/madeye.c
+++ b/drivers/infiniband/util/madeye.c
@@ -401,7 +401,7 @@ static void snoop_smi_handler(struct ib_mad_agent 
*mad_agent,
 
if (!smp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && hdr->attr_id != attr_id)
+   if (attr_id && be16_to_cpu(hdr->attr_id) != attr_id)
return;
 
printk("Madeye:sent SMP\n");
@@ -413,7 +413,7 @@ static void recv_smi_handler(struct ib_mad_agent *mad_agent,
 {
if (!smp && mad_recv_wc->recv_buf.mad->mad_hdr.mgmt_class != mgmt_class)
return;
-   if (attr_id && mad_recv_wc->recv_buf.mad->mad_hdr.attr_id != attr_id)
+   if (attr_id && be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) 
!= attr_id)
return;
 
printk("Madeye:recv SMP\n");
@@ -446,7 +446,7 @@ static void snoop_gsi_handler(struct ib_mad_agent 
*mad_agent,
 
if (!gmp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && hdr->attr_id != attr_id)
+   if (attr_id && be16_to_cpu(hdr->attr_id) != attr_id)
return;
 
printk("Madeye:sent GMP\n");
@@ -468,7 +468,7 @@ static void recv_gsi_handler(struct ib_mad_agent *mad_agent,
 
if (!gmp && hdr->mgmt_class != mgmt_class)
return;
-   if (attr_id && mad_recv_wc->recv_buf.mad->mad_hdr.attr_id != attr_id)
+   if (attr_id && be16_to_cpu(mad_recv_wc->recv_buf.mad->mad_hdr.attr_id) 
!= attr_id)
return;
 
printk("Madeye:recv GMP\n");
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] management: adding mad_dump_fields to libibmad

2010-05-06 Thread Mike Heinz

Sasha, thanks for sending me that. 

Despite asking several times over the past couple of years, you're the first 
person to actually point me to a document on how to submit patches to the group.

I will be sure to adhere to that format in the future.

-Original Message-
From: Sasha Khapyorsky [mailto:sashakv...@gmail.com] On Behalf Of Sasha 
Khapyorsky
Sent: Thursday, May 06, 2010 5:03 PM
To: Mike Heinz
Cc: linux-r...@vger.kernel.org; e...@openfabrics.org
Subject: Re: [PATCH] management: adding mad_dump_fields to libibmad

On 13:27 Thu 06 May     , Mike Heinz wrote:
> Sasha asked that I re-submit the patches for perfquery in a slightly 
> different format. This is the first of 3 patches.

I just asked to try to follow the normal patch submission format
described in details there:

http://git.kernel.org/?p=linux/kernel/git/roland/infiniband.git;a=blob;f=Documentation/SubmittingPatches

So each patch will have its own subject line, commit message, etc.

> This patch adds a function to libibmad that allows the caller to dump a 
> configurable range of MAD attributes. Basically, this provides an external 
> interface to the internal function _dump_fields.
> 
> Signed Off: Michael Heinz

All three applied. Thanks.

Sasha
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] management: adding mad_dump_fields to libibmad

2010-05-06 Thread Mike Heinz

Sasha asked that I re-submit the patches for perfquery in a slightly different 
format. This is the third of 3 patches.

This patch corrects the AllPortSelect error message that is generated by 
ibcheckerrors when used against switches that do not support that attribute.

Signed-off-by: Michael Heinz

- snip ---
diff --git a/infiniband-diags/scripts/ibcheckerrs.in 
b/infiniband-diags/scripts/ibcheckerrs.in
index 305379a..15bfd4a 100644
--- a/infiniband-diags/scripts/ibcheckerrs.in
+++ b/infiniband-diags/scripts/ibcheckerrs.in
@@ -155,6 +155,14 @@ nodename=`$IBPATH/smpquery $ca_info nodedesc $lid | sed -e 
"s/^Node Description:
 
 text="`eval $IBPATH/perfquery $ca_info $lid $portnum`"
 rv=$?
+if echo $text | grep -q 'AllPortSelect not supported'; then
+   if [ "$verbose" = "yes" ]; then
+   echo -n "Error check on lid $lid ($nodename) port $portname: "
+   green "AllPortSelect not supported"
+   fi
+   exit 0
+fi
+
 if echo "$text" | awk -v mono=$bw -v brief=$brief -F '[.:]*' '
 function blue(s)
 {


allportselect.patch
Description: allportselect.patch
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] management: adding mad_dump_fields to libibmad

2010-05-06 Thread Mike Heinz

Sasha asked that I re-submit the patches for perfquery in a slightly different 
format. This is the second of 3 patches.

This patch uses the new mad_dump_fields function to suppress the display of 
extended attributes when querying switches that do not support them.

Signed off: Michael Heinz

-- snip --
diff --git a/infiniband-diags/src/perfquery.c b/infiniband-diags/src/perfquery.c
index 00ebfff..07a9226 100644
--- a/infiniband-diags/src/perfquery.c
+++ b/infiniband-diags/src/perfquery.c
@@ -302,7 +302,10 @@ static void dump_perfcounters(int extended, int timeout, 
uint16_t cap_mask,
if (aggregate)
aggregate_perfcounters();
else
-   mad_dump_perfcounters(buf, sizeof buf, pc, sizeof pc);
+   mad_dump_fields(buf, sizeof buf, pc, sizeof pc,
+   IB_PC_FIRST_F,
+   (cap_mask & 
0x1000)?IB_PC_LAST_F:(IB_PC_RCV_PKTS_F+1));
+
} else {
if (!(cap_mask & 0x200))/* 1.2 errata: bit 9 is 
extended counter support */
IBWARN
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] management: adding mad_dump_fields to libibmad

2010-05-06 Thread Mike Heinz

Sasha asked that I re-submit the patches for perfquery in a slightly different 
format. This is the first of 3 patches.

This patch adds a function to libibmad that allows the caller to dump a 
configurable range of MAD attributes. Basically, this provides an external 
interface to the internal function _dump_fields.

Signed Off: Michael Heinz

 snip --
diff --git a/libibmad/include/infiniband/mad.h 
b/libibmad/include/infiniband/mad.h
index 02ef551..0478c2b 100644
--- a/libibmad/include/infiniband/mad.h
+++ b/libibmad/include/infiniband/mad.h
@@ -1031,6 +1031,9 @@ MAD_EXPORT ib_mad_dump_fn
 mad_dump_perfcounters_xmt_disc, mad_dump_perfcounters_rcv_err,
 mad_dump_portsamples_control;
 
+MAD_EXPORT void mad_dump_fields(char *buf, int bufsz, void *val, int valsz,
+   int start, int 
end);
+
 MAD_EXPORT int ibdebug;
 
 #if __BYTE_ORDER == __LITTLE_ENDIAN
diff --git a/libibmad/src/dump.c b/libibmad/src/dump.c
index 335e190..cc9c10f 100644
--- a/libibmad/src/dump.c
+++ b/libibmad/src/dump.c
@@ -671,6 +671,11 @@ static int _dump_fields(char *buf, int bufsz, void *data, 
int start, int end)
return (int)(s - buf);
 }
 
+void mad_dump_fields(char *buf, int bufsz, void *val, int valsz, int start, 
int end)
+{
+   return _dump_fields(buf, bufsz, val, start, end);
+}
+
 void mad_dump_nodedesc(char *buf, int bufsz, void *val, int valsz)
 {
strncpy(buf, val, bufsz);
diff --git a/libibmad/src/libibmad.map b/libibmad/src/libibmad.map
index e2d0b05..5778e3e 100644
--- a/libibmad/src/libibmad.map
+++ b/libibmad/src/libibmad.map
@@ -20,6 +20,7 @@ IBMAD_1.3 {
mad_dump_nodedesc;
mad_dump_nodeinfo;
mad_dump_opervls;
+   mad_dump_fields;
mad_dump_perfcounters;
mad_dump_perfcounters_ext;
mad_dump_perfcounters_xmt_sl;


dump_fields.patch
Description: dump_fields.patch
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] ibcheckerrors "Port All FAILED" reported

2010-05-06 Thread Mike Heinz

Yup - I've also sent a note to Sasha what happened to the patch.

-Original Message-
From: Ira Weiny [mailto:wei...@llnl.gov] 
Sent: Thursday, May 06, 2010 11:35 AM
To: Mike Heinz; Sasha Khapyorsky
Cc: Woodruff, Robert J; linux-r...@vger.kernel.org; EWG; tzipo...@mellanox.co.il
Subject: Re: [ewg] ibcheckerrors "Port All FAILED" reported

On Thu, 6 May 2010 06:26:55 -0700
Mike Heinz  wrote:

> Ira, 
> 
> I'm pretty sure I already fixed this problem. I submitted a patch to Sasha
> back in April.

The tests below are with the current master.

git://git.openfabrics.org/~sashak/management


Ira

> 
> 
> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org 
> [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Ira Weiny
> Sent: Wednesday, May 05, 2010 9:10 PM
> To: Woodruff, Robert J; linux-r...@vger.kernel.org
> Cc: EWG; tzipo...@mellanox.co.il
> Subject: Re: [ewg] ibcheckerrors "Port All FAILED" reported
> 
> Interesting...
> 
> I have a switch which does this as well.  Tracing through the scripts shows
> that the perfquery command is failing like this.
> 
> 14:29:03 > ./perfquery 40 255
> ./perfquery: iberror: failed: AllPortSelect not supported
> 
> It seems there is an issue with the CapabilityMask value...
> 
> 14:43:32 > ./perfquery 40 255
> cap_mask 0x400  <=== my debug output
> ./perfquery: iberror: failed: AllPortSelect not supported
> 
> 14:43:38 > ./saquery CPI 40
> SA ClassPortInfo:
> ...
> Capability mask..0x2602
> ...
> 
> Those don't match because...  perfquery has a bug...
> 
> perfquery is issuing a PMA query when it should be issuing a SA query.  It
> just so happens that on some switches the result of that PMA query indicates
> AllPortSelect is available.  Patch to follow.
> 
> Ira
> 
> 
> On Wed, 5 May 2010 13:47:54 -0700
> "Woodruff, Robert J"  wrote:
> 
> > 
> > Hi guys,
> > 
> > When I run ibcheckerrors on my Mellanox switch,
> > it is reporting that Port all FAILED. 
> > 
> > From what I can tell, the switch is working fine and
> > I think that this is a bogus error from the program.
> > 
> > If this is indeed not a real problem, can the diagnostic
> > be fixed to not report this as an error ?
> > 
> > 
> > ibcheckerrors -nocolor -v -t 100
> > 
> > # Checking Switch: nodeguid 0x0002c902004046a0
> > Node check lid 7: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port all: 
> > FAILED   <
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 2: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 3: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 7: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 8: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 9: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 10: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 17: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 18: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 20: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 25: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 26: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 27: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 28: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 34: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 35: OK
> > Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 36: OK
> > 
> >  Checking Ca: nodeguid 0x0002c9030002628a
> > Node check lid 14: OK
> > Error check on lid 14 (cstnh-2 HCA-1) port 1: OK
> > 
> > # Checking Ca: nodeguid 0x0002c90300025e0a
> > Node check lid 12: OK
> > Error check on lid 12 (cstnh-3 HCA-1) port 1: OK
> > 
> > # Checking Ca: nodeguid 0x0002c9030002615e
> > Node check lid 15: OK
> > Error check on lid 15 (cstnh-4 HCA-1) port 1: OK
> > 
> > # Checking Ca: nodeguid 0x0002c9030008e442
> > Node check lid 11: OK
> > Error check on lid 11 (cstnh-8 HCA-1) port 1: OK
> > 
> > # Checking Ca: nodeguid 0x0002c9030008e44e
> > Node check lid 8: OK
> > Error check on lid 8 (cstnh-11 HCA-1) port 1: OK
> > 
> > # Checking Ca: nodeguid 0x0002c9030008e3e6
> > Node check lid 2: OK
> > Error check on lid 2 (cstnh-13 HCA-1) port 1: OK
> > 
> > #

Re: [ewg] ibcheckerrors "Port All FAILED" reported

2010-05-06 Thread Mike Heinz

Ira, 

I'm pretty sure I already fixed this problem. I submitted a patch to Sasha back 
in April.


-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Ira Weiny
Sent: Wednesday, May 05, 2010 9:10 PM
To: Woodruff, Robert J; linux-r...@vger.kernel.org
Cc: EWG; tzipo...@mellanox.co.il
Subject: Re: [ewg] ibcheckerrors "Port All FAILED" reported

Interesting...

I have a switch which does this as well.  Tracing through the scripts shows
that the perfquery command is failing like this.

14:29:03 > ./perfquery 40 255
./perfquery: iberror: failed: AllPortSelect not supported

It seems there is an issue with the CapabilityMask value...

14:43:32 > ./perfquery 40 255
cap_mask 0x400  <=== my debug output
./perfquery: iberror: failed: AllPortSelect not supported

14:43:38 > ./saquery CPI 40
SA ClassPortInfo:
...
Capability mask..0x2602
...

Those don't match because...  perfquery has a bug...

perfquery is issuing a PMA query when it should be issuing a SA query.  It
just so happens that on some switches the result of that PMA query indicates
AllPortSelect is available.  Patch to follow.

Ira


On Wed, 5 May 2010 13:47:54 -0700
"Woodruff, Robert J"  wrote:

> 
> Hi guys,
> 
> When I run ibcheckerrors on my Mellanox switch,
> it is reporting that Port all FAILED. 
> 
> From what I can tell, the switch is working fine and
> I think that this is a bogus error from the program.
> 
> If this is indeed not a real problem, can the diagnostic
> be fixed to not report this as an error ?
> 
> 
> ibcheckerrors -nocolor -v -t 100
> 
> # Checking Switch: nodeguid 0x0002c902004046a0
> Node check lid 7: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port all: FAILED  
>  <
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 2: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 3: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 7: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 8: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 9: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 10: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 17: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 18: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 20: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 25: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 26: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 27: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 28: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 34: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 35: OK
> Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 36: OK
> 
>  Checking Ca: nodeguid 0x0002c9030002628a
> Node check lid 14: OK
> Error check on lid 14 (cstnh-2 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c90300025e0a
> Node check lid 12: OK
> Error check on lid 12 (cstnh-3 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030002615e
> Node check lid 15: OK
> Error check on lid 15 (cstnh-4 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e442
> Node check lid 11: OK
> Error check on lid 11 (cstnh-8 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e44e
> Node check lid 8: OK
> Error check on lid 8 (cstnh-11 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e3e6
> Node check lid 2: OK
> Error check on lid 2 (cstnh-13 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e44a
> Node check lid 18: OK
> Error check on lid 18 (cstnh-9 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c90300044fb4
> Node check lid 13: OK
> Error check on lid 13 (cstnh-7 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c90300044fbc
> Node check lid 10: OK
> Error check on lid 10 (cstnh-1 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e3ee
> Node check lid 9: OK
> Error check on lid 9 (cstnh-10 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e446
> Node check lid 4: OK
> Error check on lid 4 (cstnh-12 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e22e
> Node check lid 1: OK
> Error check on lid 1 (cstnh-14 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c9030008e43e
> Node check lid 19: OK
> Error check on lid 19 (cstnh-15 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0090270002000345
> Node check lid 6: OK
> Error check on lid 6 (cstnh-5 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0090270002000335
> Node check lid 5: OK
> Error check on lid 5 (cstnh-6 HCA-1) port 1: OK
> 
> # Checking Ca: nodeguid 0x0002c90300028238
> Node check lid 3: OK
> Error check on lid 3 (cst-linux HCA-1) port 1: OK
> 
> ## Summary: 17 nodes checked, 0 bad nodes found
> ##

Re: [ewg] ibcheckerrors "Port All FAILED" reported

2010-05-05 Thread Mike Heinz

Hi - the problem is that not all switches support the same features, and 
ibcheckerrors is treating this as an error. I believe this will be fixed in 
OFED 1.5.2.

-Original Message-
From: ewg-boun...@openfabrics.org [mailto:ewg-boun...@openfabrics.org] On 
Behalf Of Woodruff, Robert J
Sent: Wednesday, May 05, 2010 4:48 PM
To: EWG; tzipo...@mellanox.co.il
Subject: [ewg] ibcheckerrors "Port All FAILED" reported


Hi guys,

When I run ibcheckerrors on my Mellanox switch,
it is reporting that Port all FAILED. 

>From what I can tell, the switch is working fine and
I think that this is a bogus error from the program.

If this is indeed not a real problem, can the diagnostic
be fixed to not report this as an error ?


ibcheckerrors -nocolor -v -t 100

# Checking Switch: nodeguid 0x0002c902004046a0
Node check lid 7: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port all: FAILED   
<
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 2: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 3: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 7: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 8: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 9: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 10: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 17: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 18: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 20: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 25: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 26: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 27: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 28: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 34: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 35: OK
Error check on lid 7 (Infiniscale-IV Mellanox Technologies) port 36: OK

 Checking Ca: nodeguid 0x0002c9030002628a
Node check lid 14: OK
Error check on lid 14 (cstnh-2 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c90300025e0a
Node check lid 12: OK
Error check on lid 12 (cstnh-3 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030002615e
Node check lid 15: OK
Error check on lid 15 (cstnh-4 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e442
Node check lid 11: OK
Error check on lid 11 (cstnh-8 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e44e
Node check lid 8: OK
Error check on lid 8 (cstnh-11 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e3e6
Node check lid 2: OK
Error check on lid 2 (cstnh-13 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e44a
Node check lid 18: OK
Error check on lid 18 (cstnh-9 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c90300044fb4
Node check lid 13: OK
Error check on lid 13 (cstnh-7 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c90300044fbc
Node check lid 10: OK
Error check on lid 10 (cstnh-1 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e3ee
Node check lid 9: OK
Error check on lid 9 (cstnh-10 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e446
Node check lid 4: OK
Error check on lid 4 (cstnh-12 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e22e
Node check lid 1: OK
Error check on lid 1 (cstnh-14 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c9030008e43e
Node check lid 19: OK
Error check on lid 19 (cstnh-15 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0090270002000345
Node check lid 6: OK
Error check on lid 6 (cstnh-5 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0090270002000335
Node check lid 5: OK
Error check on lid 5 (cstnh-6 HCA-1) port 1: OK

# Checking Ca: nodeguid 0x0002c90300028238
Node check lid 3: OK
Error check on lid 3 (cst-linux HCA-1) port 1: OK

## Summary: 17 nodes checked, 0 bad nodes found
##  32 ports checked, 0 ports have errors beyond threshold
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Hang in ib_mad when unergistering.

2010-05-03 Thread Mike Heinz

Tziporet,

I have only seen this in OFED to date.  I've been unable to reproduce the 
problem under controlled circumstances, and since it occurs very rarely, we 
only see it after the fact, usually in our own test facility.

At this point, code inspection seems to be the only way forward - I've been 
staring at this since before New Years, thought I had found a hole where the 
problem could occur, but then the problem occurred again last week, which is 
why I'm reaching out to ewg. I'm hoping that someone who knows this module 
better than I could suggest a scenario that might be triggering the problem.

From: Tziporet Koren [mailto:tzipo...@dev.mellanox.co.il]
Sent: Sunday, May 02, 2010 4:05 PM
To: Mike Heinz
Cc: e...@openfabrics.org
Subject: Re: [ewg] Hang in ib_mad when unergistering.

On 4/30/2010 4:04 PM, Mike Heinz wrote:
Using OFED 1.5.0 and 1.5.1 we've been seeing nodes occasionally hang when a 
process tries to disconnect from the umad interface. Can anyone suggest what 
might be causing this?

Does this happened in OFED only or also in mainline kernel?
If it's happaned in the kernel too I suggest you send this mail to the 
linx-rdma mailing list

Tziporet
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Hang in ib_mad when unergistering.

2010-04-30 Thread Mike Heinz

Using OFED 1.5.0 and 1.5.1 we've been seeing nodes occasionally hang when a 
process tries to disconnect from the umad interface. Can anyone suggest what 
might be causing this?

Here's a typical example:

Apr 29 10:01:37 st2139 kernel: qlgc_dsc  D 80148c54 0  5478
 1  5497  5477 (NOTLB)
Apr 29 10:01:37 st2139 kernel:  81042b785dd8 0082 
0062f388 437b2038
Apr 29 10:01:37 st2139 kernel:   000a 
81043fa3f040 81043fb6e100
Apr 29 10:01:37 st2139 kernel:  3463ec0fbcd0 3720 
81043fa3f228 00080062f388
Apr 29 10:01:37 st2139 kernel: Call Trace:
Apr 29 10:01:37 st2139 kernel:  [] do_futex+0x282/0xc3f
Apr 29 10:01:37 st2139 kernel:  [] 
wait_for_completion+0x79/0xa2
Apr 29 10:01:37 st2139 kernel:  [] 
default_wake_function+0x0/0xe
Apr 29 10:01:37 st2139 kernel:  
[]:ib_mad:ib_cancel_rmpp_recvs+0xa6/0xe9
Apr 29 10:01:37 st2139 kernel:  
[]:ib_mad:ib_unregister_mad_agent+0x30d/0x424
Apr 29 10:01:37 st2139 kernel:  
[]:ib_umad:ib_umad_unreg_agent+0x6f/0x94
Apr 29 10:01:37 st2139 kernel:  
[]:ib_umad:ib_umad_ioctl+0x4a/0x5d
Apr 29 10:01:37 st2139 kernel:  [] do_ioctl+0x21/0x6b
Apr 29 10:01:37 st2139 kernel:  [] vfs_ioctl+0x248/0x261
Apr 29 10:01:37 st2139 kernel:  [] sys_ioctl+0x59/0x78
Apr 29 10:01:37 st2139 kernel:  [] tracesys+0xd5/0xe0

Reviewing the code, the problem is that, basically, ib_cancel_rmpp_recvs is 
waiting for a completion() to occur, but the completion() is never getting 
invoked, presumably because the reference count is wrong on one of the rmpp 
structures:

static inline void deref_rmpp_recv(struct mad_rmpp_recv *rmpp_recv)
{
if (atomic_dec_and_test(&rmpp_recv->refcount))
complete(&rmpp_recv->comp);
}

static void destroy_rmpp_recv(struct mad_rmpp_recv *rmpp_recv)
{
deref_rmpp_recv(rmpp_recv);
wait_for_completion(&rmpp_recv->comp);
ib_destroy_ah(rmpp_recv->ah);
kfree(rmpp_recv);
}

Reviewing our internal bugs database, I actually found that this problem has 
actually been around for several years, but we were never able to reproduce it 
under controlled circumstances. Most frequently, the problem occurred when 
trying to unload a module. Here's an example that was captured in 2007:


rmmod D 81003af6fd60 0 22020  21962

 81003b017c68 0082 813a22a8 81003b017c88

 81003b017c90 81003ab39800 81003fba6800 81003ab39a68

 00013b017c58 8126b945 0001 81042433

Call Trace:

 [] wait_for_completion+0xa0/0xb3

 [] flush_cpu_workqueue+0x29/0x6f

 [] default_wake_function+0x0/0xe

 [] wait_for_completion+0x8a/0xb3

 [] default_wake_function+0x0/0xe

 [] :ib_mad:ib_cancel_rmpp_recvs+0x8a/0xdf

 [] :ib_mad:ib_unregister_mad_agent+0x333/0x445

 [] :ib_sa:free_sm_ah+0x0/0x17

 [] :ib_mad:ib_agent_port_close+0x7c/0x8b

 [] :ib_mad:ib_mad_remove_device+0x38/0x85

 [] :ib_core:ib_unregister_device+0x30/0xc4

 [] :ib_ipath:ipath_unregister_ib_device+0x59/0x282

 [] :ib_ipath:ipath_remove_one+0x75/0x474

 [] pci_device_remove+0x24/0x48

 [] __device_release_driver+0x8e/0xb0

 [] driver_detach+0xce/0x10e

 [] bus_remove_driver+0x6d/0x90

 [] pci_unregister_driver+0x10/0x5f

 [] :ib_ipath:infinipath_cleanup+0x3f/0x4c

 [] sys_delete_module+0x196/0x1c5

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] Patch for libibmad

2010-04-21 Thread Mike Heinz

I agree that the stack dump is... weird, but it was reproducible, it happened 
every time they ran perfquery on their fabric. This patch (along with the other 
one) appeared to fix the problem.

> But such entries should be never used, at least not by perfquery.

The problem, I think, is the massive enumeration that's being used. Instead of 
assigning explicit values to all those constants, the code relies on the enums 
being listed in the correct order. I think that raises a risk that if the 
header is mismatched with the version of the library at compile time, (possibly 
because the user is recompiling) this problem could arise.

Anyway - I agree that we have a very poor understanding of the problem; if you 
want to hold off on this patch, that's fine. The other one is probably more 
useful.

-Original Message-
From: Sasha Khapyorsky [mailto:sashakv...@gmail.com] On Behalf Of Sasha 
Khapyorsky
Sent: Wednesday, April 21, 2010 6:09 AM
To: Mike Heinz
Cc: e...@openfabrics.org
Subject: Re: [PATCH] Patch for libibmad

Hi Mike,

On 12:16 Mon 19 Apr , Mike Heinz wrote:
> We had a customer report that perfquery was crashing on their nodes when 
> trying to query ports on a switch. When I examined the core dump, it was 
> clear that libibmad was dereferencing a null pointer from one of the mad_set_ 
> functions:
> 
> #0  0x in ?? ()
> #1  0x2ae4e13e7536 in mad_set_field () from /usr/lib64/libibmad.so.5
> #2  0x2ae4e13e7656 in mad_field_name () from /usr/lib64/libibmad.so.5
> #3  0x00401662 in mad_dump_perfcounters_rcv_sl ()
> #4  0x004024c9 in mad_dump_perfcounters_rcv_sl ()
> #5  0x2ae4e18168b4 in __libc_start_main () from /lib64/libc.so.6
> #6  0x00401189 in mad_dump_perfcounters_rcv_sl ()
> #7  0x7fffe5570ce8 in ?? ()
> #8  0x in ?? ()

I cannot find a path where mad_set_field() (or even mad_field_name())
call would be resulted by mad_dump_perfcounters_rcv_sl(). Do you?

> It appears that mad_set_field() was hitting a NULL pointer in the table of 
> MAD attributes (ib_mad_f). Such entries are being used to separate different 
> groups of mad attributes in the table.
>
> Reviewing the code, I noted that the mad_set_* and mad_get_* functions 
> already have some error checking to avoid going completely off the end of the 
> table, but they do not detect the case where the selected field is unset.

But such entries should be never used, at least not by perfquery. So it
is unclear to me how you are hitting such error.

> This patch corrects the problem.

I would like to understand the problem better before fixing something.

Sasha
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] Adding new mad_dump_fields function to libibmad so that perfquery can be a bit more selective.

2010-04-19 Thread Mike Heinz

These patches are a modification to a patch I submitted earlier, based on 
Sasha's feedback. Rather than duplicating functionality between perfquery.c and 
libibmad/dump.c, this patch exposes the internal function _dump_fields() as new 
api call, mad_dump_fields(). This permits perfquery to change which fields it 
wishes to print out more selectively than mad_dump_perfcounters() permits.

Signed Off: Michael Heinz 




dump_fields.patch
Description: dump_fields.patch


perfquery.patch
Description: perfquery.patch
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [PATCH] Patch for libibmad

2010-04-19 Thread Mike Heinz

We had a customer report that perfquery was crashing on their nodes when trying 
to query ports on a switch. When I examined the core dump, it was clear that 
libibmad was dereferencing a null pointer from one of the mad_set_ functions:

#0  0x in ?? ()
#1  0x2ae4e13e7536 in mad_set_field () from /usr/lib64/libibmad.so.5
#2  0x2ae4e13e7656 in mad_field_name () from /usr/lib64/libibmad.so.5
#3  0x00401662 in mad_dump_perfcounters_rcv_sl ()
#4  0x004024c9 in mad_dump_perfcounters_rcv_sl ()
#5  0x2ae4e18168b4 in __libc_start_main () from /lib64/libc.so.6
#6  0x00401189 in mad_dump_perfcounters_rcv_sl ()
#7  0x7fffe5570ce8 in ?? ()
#8  0x in ?? ()

It appears that mad_set_field() was hitting a NULL pointer in the table of MAD 
attributes (ib_mad_f). Such entries are being used to separate different groups 
of mad attributes in the table.

Reviewing the code, I noted that the mad_set_* and mad_get_* functions already 
have some error checking to avoid going completely off the end of the table, 
but they do not detect the case where the selected field is unset. This patch 
corrects the problem.

Signed Off By: Michael Heinz 


fields.patch
Description: fields.patch
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] OFED 1.5, RHEL4: IPv6 doesn't work between RHEL4 hosts and other distros.

2010-02-26 Thread Mike Heinz

One of my testers reported this, has anyone else seen it? Given two RHEL4 
hosts, IPV6 works correctly, (according to the testers) but IPV6 does not work 
between hosts running RHEL4u8 and hosts running SLES10 or RHEL5.

Given an RHEL5 host:

[r...@homer ~]# ifconfig ib0
ib0   Link encap:InfiniBand  HWaddr 
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
  inet addr:172.21.33.208  Bcast:172.21.33.255  Mask:255.255.255.0
  inet6 addr: fe80::206:6a00:a000:707f/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:17 errors:0 dropped:0 overruns:0 frame:0
  TX packets:9 errors:0 dropped:24 overruns:0 carrier:0
  collisions:0 txqueuelen:256 

And an RHEL4 host:

[r...@apu ~]# ifconfig ib0
ib0   Link encap:UNSPEC  HWaddr 
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00  
  inet addr:172.21.33.210  Bcast:172.21.33.255  Mask:255.255.255.0
  inet6 addr: fe80::206:6a00:a000:6ca8/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:15 errors:0 dropped:0 overruns:0 frame:0
  TX packets:26 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:256 
  RX bytes:1092 (1.0 KiB)  TX bytes:2176 (2.1 KiB)

pinging over ipv4 works:

[r...@homer ~]# ping 172.21.33.210
PING 172.21.33.210 (172.21.33.210) 56(84) bytes of data.
64 bytes from 172.21.33.210: icmp_seq=1 ttl=64 time=0.064 ms
64 bytes from 172.21.33.210: icmp_seq=2 ttl=64 time=0.035 ms
64 bytes from 172.21.33.210: icmp_seq=3 ttl=64 time=0.024 ms
64 bytes from 172.21.33.210: icmp_seq=4 ttl=64 time=0.026 ms

[r...@apu ~]# ping 172.21.33.208
PING 172.21.33.208 (172.21.33.208) 56(84) bytes of data.
64 bytes from 172.21.33.208: icmp_seq=0 ttl=64 time=0.053 ms
64 bytes from 172.21.33.208: icmp_seq=1 ttl=64 time=0.026 ms
64 bytes from 172.21.33.208: icmp_seq=2 ttl=64 time=0.026 ms
64 bytes from 172.21.33.208: icmp_seq=3 ttl=64 time=0.027 ms
64 bytes from 172.21.33.208: icmp_seq=4 ttl=64 time=0.025 ms

However, pinging over ipv6 fails:

[r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:6ca8
PING fe80::206:6a00:a000:6ca8(fe80::206:6a00:a000:6ca8) from 
fe80::206:6a00:a000:707f ib0: 56 data bytes
>From fe80::206:6a00:a000:707f icmp_seq=1 Destination unreachable: Address 
>unreachable
>From fe80::206:6a00:a000:707f icmp_seq=2 Destination unreachable: Address 
>unreachable
>From fe80::206:6a00:a000:707f icmp_seq=3 Destination unreachable: Address 
>unreachable

But pinging over ipv6 works between rhel5 boxes:

[r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:7d5e
PING fe80::206:6a00:a000:7d5e(fe80::206:6a00:a000:7d5e) from 
fe80::206:6a00:a000:707f ib0: 56 data bytes
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=0 ttl=64 time=1.72 ms
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=3 ttl=64 time=0.044 ms

Similarly, the RHEL5 host can ping IPV6 to a SLES10 host:

[r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:6cc1
PING fe80::206:6a00:a000:6cc1(fe80::206:6a00:a000:6cc1) from 
fe80::206:6a00:a000:707f ib0: 56 data bytes
64 bytes from fe80::206:6a00:a000:6cc1: icmp_seq=0 ttl=64 time=1.91 ms
64 bytes from fe80::206:6a00:a000:6cc1: icmp_seq=1 ttl=64 time=0.048 ms

Any ideas? Has anyone seen this? I tried turning on debugging in ipoib, but no 
additional information was output.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

53 matches

Mail list logo