Re: Errors from ibchecknet
I checked another working fabric here and also see the same warnings, so it looks like the warnings are not really a problem. Well, I assume that it is just IPoIB that isn't working. Since ibping works, I believe that says the IB part is ok. Of course, I can't run any of the perftools since they all need IPoIB to resolve the host IP. Do you have any suggestions of what to check to diagnose the IPoIB problem? Specifically, can you think of any interaction with the normal networking stuff in the kernel that might be misconfigured? The reason I mention that is because I rebuilt/installed OFED (no errors/warnings) and it is in its default configuration, which is running well on other similar fabrics here. Therefore I assume the problem must be with the non-OFED stuff. Previously, whenever this kind of problem cropped up it has always been because opensm was not running. I did check that iptables was off, so it isn't a firewall issue. - Chuck On Thu, Sep 2, 2010 at 4:16 PM, Ira Weiny wei...@llnl.gov wrote: On Thu, 2 Sep 2010 11:11:13 -0700 Chuck Hartley hartlc...@gmail.com wrote: Sure, here is the output: Note this is with the switch we swapped in, so the port numbers don't match the ibchecknet output in the original message. # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c90300032de0 System image GUID: 0x0002c90300032de3 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 6 LMC: 0 SM lid: 6 Well the SM lid is set here. Is it set on the other nodes? I don't run ibchecknet usually but I am getting the same errors here on a working fabric... ibwarn: [13629] dump_perfcounters: PortXmitWait not indicated so ignore this counter #warn: Lid is not configured lid 37 port 2 #warn: SM Lid is not configured Port check lid 37 port 2: FAILED Looking at this output I don't think this is an error. 13:17:14 smpquery nodeinfo 37 # Node info: Lid 37 BaseVers:1 ClassVers:...1 NodeType:Switch NumPorts:24 ... On switch external Ports the Lid and SMLid are not used. Hal, would you concur? Chuck, Is it just that IPoIB is not working for you? Ira Capability mask: 0x0251086a Port GUID: 0x0002c90300032de1 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300032de2 CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x003048c64c0c System image GUID: 0x003048c64c0c0003 Port 1: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x003048c64c0c0001 # iblinkinfo Switch 0x0002c9020041a7a0 Infiniscale-IV Mellanox Technologies: 1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 5 1[ ] HCA-1 ( ) 1 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 6 1[ ] linux70 HCA-1 ( ) 1 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 7 1[ ] linux71 HCA-1 ( ) 1 4[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 5[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 6[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 7[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 8[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 9[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 10[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 11[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 12[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 13[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 14[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 15[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 16[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 17[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 18[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 19[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 20[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 21[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 22[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 23[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 24
Errors from ibchecknet
Hello, We installed 1.5.1 and are having problems getting the IB fabric working. ibv_devinfo shows the HCAs ports are ok and ibdiagnet reports no errors. However, ibchecknet shows that the switch ports are not being configured. We have never seen this before and are at a loss as to where the problem might be - would someone please point us in the right direction to look? Could it be a problem with the switch itself? Output from ibchecknet below. # ibchecknet Error check on lid 3 (Infiniscale-IV Mellanox Technologies) port all: FAILED ibwarn: [26732] dump_perfcounters: PortXmitWait not indicated so ignore this counter #warn: Lid is not configured lid 3 port 7 #warn: SM Lid is not configured Port check lid 3 port 7: FAILED # Checked Switch: nodeguid 0x0002c90200405368 with failure ibwarn: [26751] dump_perfcounters: PortXmitWait not indicated so ignore this counter #warn: Lid is not configured lid 3 port 10 #warn: SM Lid is not configured Port check lid 3 port 10: FAILED ibwarn: [26770] dump_perfcounters: PortXmitWait not indicated so ignore this counter #warn: Lid is not configured lid 3 port 11 #warn: SM Lid is not configured Port check lid 3 port 11: FAILED ibwarn: [26789] dump_perfcounters: PortXmitWait not indicated so ignore this counter #warn: Lid is not configured lid 3 port 34 #warn: SM Lid is not configured Port check lid 3 port 34: FAILED ibwarn: [26808] dump_perfcounters: PortXmitWait not indicated so ignore this counter #warn: Lid is not configured lid 3 port 35 #warn: SM Lid is not configured Port check lid 3 port 35: FAILED # Checking Ca: nodeguid 0x0030487f3076 ibwarn: [26832] dump_perfcounters: PortXmitWait not indicated so ignore this counter # Checking Ca: nodeguid 0x0030487f32b2 ibwarn: [26856] dump_perfcounters: PortXmitWait not indicated so ignore this counter # Checking Ca: nodeguid 0x0002c9030003360c # Checking Ca: nodeguid 0x0002c90300084162 ibwarn: [26904] dump_perfcounters: PortXmitWait not indicated so ignore this counter # Checking Ca: nodeguid 0x0002c90300032de0 ## Summary: 6 nodes checked, 0 bad nodes found ## 10 ports checked, 5 bad ports found ## 0 ports have errors beyond threshold -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Errors from ibchecknet
Sure, here is the output: Note this is with the switch we swapped in, so the port numbers don't match the ibchecknet output in the original message. # ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.6.0 Hardware version: a0 Node GUID: 0x0002c90300032de0 System image GUID: 0x0002c90300032de3 Port 1: State: Active Physical state: LinkUp Rate: 40 Base lid: 6 LMC: 0 SM lid: 6 Capability mask: 0x0251086a Port GUID: 0x0002c90300032de1 Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300032de2 CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x003048c64c0c System image GUID: 0x003048c64c0c0003 Port 1: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x003048c64c0c0001 # iblinkinfo Switch 0x0002c9020041a7a0 Infiniscale-IV Mellanox Technologies: 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 5 1[ ] HCA-1 ( ) 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 6 1[ ] linux70 HCA-1 ( ) 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)== 7 1[ ] linux71 HCA-1 ( ) 14[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 15[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 16[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 17[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 18[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 19[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 10[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 11[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 12[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 13[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 14[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 15[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 16[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 17[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 18[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 19[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 20[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 21[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 22[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 23[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 24[ ] ==( 4X 5.0 Gbps Active/ LinkUp)== 9 1[ ] HCA-1 ( ) 1 25[ ] ==( 4X 5.0 Gbps Active/ LinkUp)== 8 1[ ] HCA-1 ( ) 1 26[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 27[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 28[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 29[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 30[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 31[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 32[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 33[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 34[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 35[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) 1 36[ ] ==( 4X 2.5 Gbps Down/ Polling)== [ ] ( ) On Thu, Sep 2, 2010 at 12:03 PM, Ira Weiny wei...@llnl.gov wrote: On Thu, 2 Sep 2010 06:56:50 -0700 Chuck Hartley hartlc...@gmail.com wrote: We swapped in a different switch and see the same errors. The opensm logfile does not show any errors: Could you run ibstat on the node with OpenSM running? And iblinkinfo on the same node? Send that output. Ira - OpenSM 3.3.5 Command Line Arguments: Daemon mode Log File: /var/log/opensm.log - OpenSM 3.3.5 Sep 02 05:56:29 933684 [B53B8700] 0x80 - OpenSM 3.3.5 Entering DISCOVERING state Sep 02 05:56:29 934931 [B53B8700] 0x02 - osm_vendor_init: 1000 pending umads specified Sep 02 05:56:29 935079 [B53B8700] 0x80 - Entering DISCOVERING state Using default GUID 0x2c90300032de1 Entering MASTER state Sep 02 05:56:29 953763 [B53B8700] 0x02 - osm_vendor_bind: Binding to port 0x2c90300032de1 Sep 02 05:56:29 990146 [B53B8700] 0x02
Re: Errors from ibchecknet
BTW, I am able to communicate between nodes via 'ibping'. That is the only test program I found that will work without needing a host IP. On Thu, Sep 2, 2010 at 12:03 PM, Ira Weiny wei...@llnl.gov wrote: On Thu, 2 Sep 2010 06:56:50 -0700 Chuck Hartley hartlc...@gmail.com wrote: We swapped in a different switch and see the same errors. The opensm logfile does not show any errors: Could you run ibstat on the node with OpenSM running? And iblinkinfo on the same node? Send that output. Ira -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ibstat stuck in state initialized after reboot
On Wed, Mar 24, 2010 at 2:25 PM, Ira Weiny wei...@llnl.gov wrote: On Wed, 24 Mar 2010 11:34:02 -0600 Michael Robbert mrobb...@mines.edu wrote: I will second this. OpenSM has come a long way since the time Cisco was selling IB switches. If I understand your situation you don't even need the 7000D you could just remove it and run OpenSM on a management node. If you can afford it adding a node for OpenSM would be nice but I am not sure you _need_ it. OpenSM is now managing many of the largest IB networks out there, on a 288 node system it will have no problems at all out of the box. Can you provide any guidelines to determine when a dedicated management node is beneficial? BTW, we also found that OpenSM is superior to to the SM embedded in our switches. Chuck -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html