On Wed, 24 Mar 2010 10:26:02 -0600
Michael Robbert <mrobb...@mines.edu> wrote:

> I hope this is the correct place to get help with the problem I have. I have
> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
> manager and the whole thing has been running great for well over a year now,
> but today I noticed that after any node gets rebooted its IB link doesn't
> initialize. This has happened on 4 hosts now. What I see is as follows:
> 
> [r...@compute-2-7 ~]# ibstat
> CA 'mthca0'
>        CA type: MT25204
>        Number of ports: 1
>        Firmware version: 1.2.917
>        Hardware version: 20
>        Node GUID: 0x0005ad00000c0990
>        System image GUID: 0x0005ad000100d050
>        Port 1:
>                State: Initializing
>                Physical state: LinkUp
>                Rate: 20
>                Base lid: 0
>                LMC: 0
>                SM lid: 0
>                Capability mask: 0x02510a68
>                Port GUID: 0x0005ad00000c0991
> 
> I don't know much about subnet managers, since ours is in hardware and we've
> never had to configure anything on it, but I can login to the device and it
> isn't showing any errors. On a node that hasn't been rebooted recently and
> is still working I can see what appears to be a working subnet manager:
> 
> [r...@compute-2-10 ~]# sminfo 
> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 
> 10 state 3 SMINFO_MASTER
> 
> The same command on a non-working node shows this:
> 
> [r...@compute-2-7 ~]# sminfo 
> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 
> SMINFO_STANDBY
> 
> So far I have reseated all the cables involved on both ends and I have moved
> the cables on the switch end to new ports and none of that has made a
> difference even after reboots. I am hoping to find a node that I can take
> offline tomorrow so I can actually test the cables, but since this seems to
> be happening to any host that reboots it doesn't appear to be a cabling
> problem. Can anybody suggest where I should go from here? Is there anything
> I can do from a working or non-working host to diagnose the problem? Should
> I try rebooting the subnet manager switch? Will that affect the rest of the
> fabric? 

Have you spoken to Cisco about the problem?  You say you can log into the
"device" (the SM switch?) if so talk to Cisco about how you may be able to
restart the SM there.

It does sound like the SM on the switch is failing to transition the links.
If you can restart the SM on the switch I would try that first.  Otherwise yes
rebooting the switch is probably your best bet, and yes it will affect the
fabric, although I can't say how much without knowing the topology.

Ira

> 
> Thanks,
> Mike Robbert
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to