Ira, Thanks for the quick response. That is what I was afraid of. I've been looking through the switch documentation, but it doesn't cover starting, stopping, or even checking the status of the SM service. I'll look into opening a TAC case, but since Cisco has gotten out of the IB business I'm not looking forward to seeing what kind of product support they still have. I can tell you a little more about our topology since it is pretty simple. All of our hosts are connected to the single large SFS switch, then the 7000D which is our subnet-manager is only plugged into that larger switch.
Thanks for the help and wish me luck with support! Mike On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote: > On Wed, 24 Mar 2010 10:26:02 -0600 > Michael Robbert <mrobb...@mines.edu> wrote: > >> I hope this is the correct place to get help with the problem I have. I have >> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet >> manager and the whole thing has been running great for well over a year now, >> but today I noticed that after any node gets rebooted its IB link doesn't >> initialize. This has happened on 4 hosts now. What I see is as follows: >> >> [r...@compute-2-7 ~]# ibstat >> CA 'mthca0' >> CA type: MT25204 >> Number of ports: 1 >> Firmware version: 1.2.917 >> Hardware version: 20 >> Node GUID: 0x0005ad00000c0990 >> System image GUID: 0x0005ad000100d050 >> Port 1: >> State: Initializing >> Physical state: LinkUp >> Rate: 20 >> Base lid: 0 >> LMC: 0 >> SM lid: 0 >> Capability mask: 0x02510a68 >> Port GUID: 0x0005ad00000c0991 >> >> I don't know much about subnet managers, since ours is in hardware and we've >> never had to configure anything on it, but I can login to the device and it >> isn't showing any errors. On a node that hasn't been rebooted recently and >> is still working I can see what appears to be a working subnet manager: >> >> [r...@compute-2-10 ~]# sminfo >> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority >> 10 state 3 SMINFO_MASTER >> >> The same command on a non-working node shows this: >> >> [r...@compute-2-7 ~]# sminfo >> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 >> SMINFO_STANDBY >> >> So far I have reseated all the cables involved on both ends and I have moved >> the cables on the switch end to new ports and none of that has made a >> difference even after reboots. I am hoping to find a node that I can take >> offline tomorrow so I can actually test the cables, but since this seems to >> be happening to any host that reboots it doesn't appear to be a cabling >> problem. Can anybody suggest where I should go from here? Is there anything >> I can do from a working or non-working host to diagnose the problem? Should >> I try rebooting the subnet manager switch? Will that affect the rest of the >> fabric? > > Have you spoken to Cisco about the problem? You say you can log into the > "device" (the SM switch?) if so talk to Cisco about how you may be able to > restart the SM there. > > It does sound like the SM on the switch is failing to transition the links. > If you can restart the SM on the switch I would try that first. Otherwise yes > rebooting the switch is probably your best bet, and yes it will affect the > fabric, although I can't say how much without knowing the topology. > > Ira > >> >> Thanks, >> Mike Robbert >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://*vger.kernel.org/majordomo-info.html >> > > > -- > Ira Weiny > Math Programmer/Computer Scientist > Lawrence Livermore National Lab > 925-423-8008 > wei...@llnl.gov -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html