Re: Errors from ibchecknet

2010-09-03 Thread Chuck Hartley
I checked another working  fabric here and also see the same warnings,
so it looks like the warnings are not really a problem.

Well, I assume that it is just IPoIB that isn't working. Since ibping
works, I believe that says the IB part is ok. Of course, I can't run
any of the perftools since they all need IPoIB to resolve the host IP.

Do you have any suggestions of what to check to diagnose the IPoIB
problem?  Specifically, can you think of any interaction with the
normal networking stuff in the kernel that might be misconfigured?
The reason I mention that is because I rebuilt/installed OFED (no
errors/warnings) and it is in its default configuration, which is
running well on other similar fabrics here.  Therefore I assume the
problem must be with the non-OFED stuff. Previously, whenever this
kind of problem cropped up it has always been because opensm was not
running. I did check that iptables was off, so it isn't a firewall
issue.

- Chuck


On Thu, Sep 2, 2010 at 4:16 PM, Ira Weiny wei...@llnl.gov wrote:
 On Thu, 2 Sep 2010 11:11:13 -0700
 Chuck Hartley hartlc...@gmail.com wrote:

 Sure, here is the output:
 Note this is with the switch we swapped in, so the port numbers don't
 match the ibchecknet output in the original message.

 # ibstat
 CA 'mlx4_0'
       CA type: MT26428
       Number of ports: 2
       Firmware version: 2.6.0
       Hardware version: a0
       Node GUID: 0x0002c90300032de0
       System image GUID: 0x0002c90300032de3
       Port 1:
               State: Active
               Physical state: LinkUp
               Rate: 40
               Base lid: 6
               LMC: 0
               SM lid: 6

 Well the SM lid is set here.  Is it set on the other nodes?

 I don't run ibchecknet usually but I am getting the same errors here on a
 working fabric...

 ibwarn: [13629] dump_perfcounters: PortXmitWait not indicated so ignore this 
 counter
 #warn: Lid is not configured lid 37 port 2
 #warn: SM Lid is not configured
 Port check lid 37 port 2:  FAILED

 Looking at this output I don't think this is an error.

 13:17:14  smpquery nodeinfo 37
 # Node info: Lid 37
 BaseVers:1
 ClassVers:...1
 NodeType:Switch
 NumPorts:24
 ...

 On switch external Ports the Lid and SMLid are not used.

 Hal, would you concur?

 Chuck,
 Is it just that IPoIB is not working for you?

 Ira


               Capability mask: 0x0251086a
               Port GUID: 0x0002c90300032de1
       Port 2:
               State: Down
               Physical state: Polling
               Rate: 10
               Base lid: 0
               LMC: 0
               SM lid: 0
               Capability mask: 0x02510868
               Port GUID: 0x0002c90300032de2
 CA 'mthca0'
       CA type: MT25204
       Number of ports: 1
       Firmware version: 1.2.0
       Hardware version: a0
       Node GUID: 0x003048c64c0c
       System image GUID: 0x003048c64c0c0003
       Port 1:
               State: Down
               Physical state: Polling
               Rate: 10
               Base lid: 0
               LMC: 0
               SM lid: 0
               Capability mask: 0x02510a68
               Port GUID: 0x003048c64c0c0001

 # iblinkinfo
 Switch 0x0002c9020041a7a0 Infiniscale-IV Mellanox Technologies:
            1    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==       5
 1[  ]  HCA-1 ( )
            1    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==       6
 1[  ] linux70 HCA-1 ( )
            1    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==       7
 1[  ] linux71 HCA-1 ( )
            1    4[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1    5[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1    6[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1    7[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1    8[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1    9[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   10[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   11[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   12[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   13[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   14[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   15[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   16[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   17[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   18[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   19[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   20[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   21[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   22[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   23[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
 [  ]  ( )
            1   24

Errors from ibchecknet

2010-09-02 Thread Chuck Hartley
Hello,

We installed 1.5.1 and are having problems getting the IB fabric
working. ibv_devinfo shows the HCAs ports are ok and ibdiagnet reports
no errors. However, ibchecknet shows that the switch ports are not
being configured.  We have never seen this before and are at a loss as
to where the problem might be - would someone please point us in the
right direction to look?  Could it be a problem with the switch
itself? Output from ibchecknet below.


# ibchecknet
Error check on lid 3 (Infiniscale-IV Mellanox Technologies) port all:  FAILED
ibwarn: [26732] dump_perfcounters: PortXmitWait not indicated so
ignore this counter
#warn: Lid is not configured lid 3 port 7
#warn: SM Lid is not configured
Port check lid 3 port 7:  FAILED
# Checked Switch: nodeguid 0x0002c90200405368 with failure
ibwarn: [26751] dump_perfcounters: PortXmitWait not indicated so
ignore this counter
#warn: Lid is not configured lid 3 port 10
#warn: SM Lid is not configured
Port check lid 3 port 10:  FAILED
ibwarn: [26770] dump_perfcounters: PortXmitWait not indicated so
ignore this counter
#warn: Lid is not configured lid 3 port 11
#warn: SM Lid is not configured
Port check lid 3 port 11:  FAILED
ibwarn: [26789] dump_perfcounters: PortXmitWait not indicated so
ignore this counter
#warn: Lid is not configured lid 3 port 34
#warn: SM Lid is not configured
Port check lid 3 port 34:  FAILED
ibwarn: [26808] dump_perfcounters: PortXmitWait not indicated so
ignore this counter
#warn: Lid is not configured lid 3 port 35
#warn: SM Lid is not configured
Port check lid 3 port 35:  FAILED

# Checking Ca: nodeguid 0x0030487f3076
ibwarn: [26832] dump_perfcounters: PortXmitWait not indicated so
ignore this counter

# Checking Ca: nodeguid 0x0030487f32b2
ibwarn: [26856] dump_perfcounters: PortXmitWait not indicated so
ignore this counter

# Checking Ca: nodeguid 0x0002c9030003360c

# Checking Ca: nodeguid 0x0002c90300084162
ibwarn: [26904] dump_perfcounters: PortXmitWait not indicated so
ignore this counter

# Checking Ca: nodeguid 0x0002c90300032de0

## Summary: 6 nodes checked, 0 bad nodes found
##  10 ports checked, 5 bad ports found
##  0 ports have errors beyond threshold
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Errors from ibchecknet

2010-09-02 Thread Chuck Hartley
Sure, here is the output:
Note this is with the switch we swapped in, so the port numbers don't
match the ibchecknet output in the original message.

# ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.6.0
Hardware version: a0
Node GUID: 0x0002c90300032de0
System image GUID: 0x0002c90300032de3
Port 1:
State: Active
Physical state: LinkUp
Rate: 40
Base lid: 6
LMC: 0
SM lid: 6
Capability mask: 0x0251086a
Port GUID: 0x0002c90300032de1
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c90300032de2
CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.0
Hardware version: a0
Node GUID: 0x003048c64c0c
System image GUID: 0x003048c64c0c0003
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x003048c64c0c0001

# iblinkinfo
Switch 0x0002c9020041a7a0 Infiniscale-IV Mellanox Technologies:
   11[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==   5
1[  ]  HCA-1 ( )
   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==   6
1[  ] linux70 HCA-1 ( )
   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==   7
1[  ] linux71 HCA-1 ( )
   14[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   15[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   16[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   17[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   18[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   19[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   10[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   11[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   12[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   13[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   14[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   15[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   16[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   17[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   18[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   19[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   20[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   21[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   22[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   23[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   24[  ] ==( 4X 5.0 Gbps Active/  LinkUp)==   9
1[  ]  HCA-1 ( )
   1   25[  ] ==( 4X 5.0 Gbps Active/  LinkUp)==   8
1[  ]  HCA-1 ( )
   1   26[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   27[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   28[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   29[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   30[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   31[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   32[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   33[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   34[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   35[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )
   1   36[  ] ==( 4X 2.5 Gbps   Down/ Polling)==
[  ]  ( )

On Thu, Sep 2, 2010 at 12:03 PM, Ira Weiny wei...@llnl.gov wrote:
 On Thu, 2 Sep 2010 06:56:50 -0700
 Chuck Hartley hartlc...@gmail.com wrote:

 We swapped in a different switch and see the same errors. The opensm
 logfile does not show any errors:

 Could you run ibstat on the node with OpenSM running?

 And iblinkinfo on the same node?

 Send that output.

 Ira


 -
 OpenSM 3.3.5
 Command Line Arguments:
  Daemon mode
  Log File: /var/log/opensm.log
 -
 OpenSM 3.3.5

 Sep 02 05:56:29 933684 [B53B8700] 0x80 - OpenSM 3.3.5
 Entering DISCOVERING state

 Sep 02 05:56:29 934931 [B53B8700] 0x02 - osm_vendor_init: 1000
 pending umads specified
 Sep 02 05:56:29 935079 [B53B8700] 0x80 - Entering DISCOVERING state
 Using default GUID 0x2c90300032de1
 Entering MASTER state

 Sep 02 05:56:29 953763 [B53B8700] 0x02 - osm_vendor_bind: Binding to
 port 0x2c90300032de1
 Sep 02 05:56:29 990146 [B53B8700] 0x02

Re: Errors from ibchecknet

2010-09-02 Thread Chuck Hartley
BTW, I am able to communicate between nodes via 'ibping'.  That is the
only test program I found that will work without needing a host IP.



On Thu, Sep 2, 2010 at 12:03 PM, Ira Weiny wei...@llnl.gov wrote:
 On Thu, 2 Sep 2010 06:56:50 -0700
 Chuck Hartley hartlc...@gmail.com wrote:

 We swapped in a different switch and see the same errors. The opensm
 logfile does not show any errors:

 Could you run ibstat on the node with OpenSM running?

 And iblinkinfo on the same node?

 Send that output.

 Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Chuck Hartley
On Wed, Mar 24, 2010 at 2:25 PM, Ira Weiny wei...@llnl.gov wrote:
 On Wed, 24 Mar 2010 11:34:02 -0600
 Michael Robbert mrobb...@mines.edu wrote:

 I will second this.  OpenSM has come a long way since the time Cisco was
 selling IB switches.  If I understand your situation you don't even need the
 7000D you could just remove it and run OpenSM on a management node.  If you
 can afford it adding a node for OpenSM would be nice but I am not sure you
 _need_ it.

 OpenSM is now managing many of the largest IB networks out there, on a 288
 node system it will have no problems at all out of the box.


Can you provide any guidelines to determine when a dedicated
management node is beneficial?

BTW, we also found that OpenSM is superior to to the SM embedded in
our switches.

Chuck
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html