I just discovered another interesting point. I tried to start opensm on one of 
my hosts and it went into STANDBY state. Here is the log of it trying to start 
up:

Mar 24 12:23:25 117170 [66DAC170] 0x80 -> OpenSM 3.3.5
Entering DISCOVERING state

Mar 24 12:23:25 117863 [66DAC170] 0x02 -> osm_vendor_init: 1000 pending umads 
specified
Mar 24 12:23:25 118022 [66DAC170] 0x80 -> Entering DISCOVERING state
Mar 24 12:23:25 120961 [66DAC170] 0x02 -> osm_vendor_bind: Binding to port 
0x5ad00000bf1e1
Mar 24 12:23:25 129023 [66DAC170] 0x02 -> osm_vendor_bind: Binding to port 
0x5ad00000bf1e1
Mar 24 12:23:25 129069 [66DAC170] 0x02 -> osm_opensm_bind: Setting IS_SM on 
port 0x0005ad00000bf1e1
Mar 24 12:23:26 120384 [42E1E940] 0x01 -> umad_receiver: ERR 5411: DR SMP Send 
completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0xf00001a51, Hop Ptr: 0x0
Mar 24 12:23:26 120444 [42E1E940] 0x01 -> Received SMP on a 4 hop path: Initial 
path = 0,0,0,0,0, Return path  = 0,0,0,0,0
Mar 24 12:23:26 120461 [42E1E940] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: 
MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1a51
Using default GUID 0x5ad00000bf1e1
Entering STANDBY state

Mar 24 12:23:26 120538 [42C1D940] 0x80 -> Entering STANDBY state

Does that change the diagnosis at all? I'm still waiting for a response from 
t...@cisco.com

Thanks,
Mike

On Mar 24, 2010, at 11:34 AM, Michael Robbert wrote:

> Interesting note! The 7024 is our large switch where all the hosts are 
> connected, but I was told that we were sold the 7000D because the 7024 didn't 
> have a subnet manager. Unfortunately the 7000D has a different CLI and that 
> command is not available and I don't have the password for our 7024 so I 
> can't log onto it. 
> On another note I just noticed the uptime on the 7000D is just over 1 day so 
> that must have been the start of the problem, but I have no idea why it 
> rebooted nor why it didn't come up working. I'm pretty sure we tested a 
> reboot of the device during acceptance testing.
> 
> Oh, I just got your second note:
> ==================================
> BTW, I highly recommend running the opensm on a server instead of using the 
> sm on the switch.  We found running the sm on the switch was much less 
> reliable.  I also recommend using a server dedicated to opensm only.
> ==================================
> 
> I will take that into consideration, but we bought this as a "turn-key" 
> solution from Dell. They designed it and we had no experience with IB so we 
> trusted their knowledge. 
> 
> Thanks,
> Mike
> 
> 
> On Mar 24, 2010, at 11:12 AM, Meyer, Donald J wrote:
> 
>> http://www.cisco.com/en/US/docs/server_nw_virtual/7024/release_4.1/hardware/installation/guide/7024hig.pdf
>> 
>> smControl
>> Starts and stops the embedded subnet manager.
>> Syntax:
>> smControl start | stop | restart | status
>> 
>> Thanks,
>> Don Meyer
>> Senior Network/System Engineer/Programmer
>> US+ (253) 371-9532 iNet 8-371-9532
>> *Other names and brands may be claimed as the property of others
>> -----Original Message-----
>> From: linux-rdma-ow...@vger.kernel.org 
>> [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Michael Robbert
>> Sent: Wednesday, March 24, 2010 10:00 AM
>> To: Ira Weiny
>> Cc: linux-rdma@vger.kernel.org
>> Subject: Re: ibstat stuck in state initialized after reboot
>> 
>> Ira,
>> Thanks for the quick response. That is what I was afraid of. I've been 
>> looking through the switch documentation, but it doesn't cover starting, 
>> stopping, or even checking the status of the SM service. I'll look into 
>> opening a TAC case, but since Cisco has gotten out of the IB business I'm 
>> not looking forward to seeing what kind of product support they still have. 
>> I can tell you a little more about our topology since it is pretty simple. 
>> All of our hosts are connected to the single large SFS switch, then the 
>> 7000D which is our subnet-manager is only plugged into that larger switch. 
>> 
>> Thanks for the help and wish me luck with support!
>> 
>> Mike
>> 
>> On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:
>> 
>>> On Wed, 24 Mar 2010 10:26:02 -0600
>>> Michael Robbert <mrobb...@mines.edu> wrote:
>>> 
>>>> I hope this is the correct place to get help with the problem I have. I 
>>>> have
>>>> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
>>>> manager and the whole thing has been running great for well over a year 
>>>> now,
>>>> but today I noticed that after any node gets rebooted its IB link doesn't
>>>> initialize. This has happened on 4 hosts now. What I see is as follows:
>>>> 
>>>> [r...@compute-2-7 ~]# ibstat
>>>> CA 'mthca0'
>>>>     CA type: MT25204
>>>>     Number of ports: 1
>>>>     Firmware version: 1.2.917
>>>>     Hardware version: 20
>>>>     Node GUID: 0x0005ad00000c0990
>>>>     System image GUID: 0x0005ad000100d050
>>>>     Port 1:
>>>>             State: Initializing
>>>>             Physical state: LinkUp
>>>>             Rate: 20
>>>>             Base lid: 0
>>>>             LMC: 0
>>>>             SM lid: 0
>>>>             Capability mask: 0x02510a68
>>>>             Port GUID: 0x0005ad00000c0991
>>>> 
>>>> I don't know much about subnet managers, since ours is in hardware and 
>>>> we've
>>>> never had to configure anything on it, but I can login to the device and it
>>>> isn't showing any errors. On a node that hasn't been rebooted recently and
>>>> is still working I can see what appears to be a working subnet manager:
>>>> 
>>>> [r...@compute-2-10 ~]# sminfo 
>>>> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 
>>>> priority 10 state 3 SMINFO_MASTER
>>>> 
>>>> The same command on a non-working node shows this:
>>>> 
>>>> [r...@compute-2-7 ~]# sminfo 
>>>> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 
>>>> SMINFO_STANDBY
>>>> 
>>>> So far I have reseated all the cables involved on both ends and I have 
>>>> moved
>>>> the cables on the switch end to new ports and none of that has made a
>>>> difference even after reboots. I am hoping to find a node that I can take
>>>> offline tomorrow so I can actually test the cables, but since this seems to
>>>> be happening to any host that reboots it doesn't appear to be a cabling
>>>> problem. Can anybody suggest where I should go from here? Is there anything
>>>> I can do from a working or non-working host to diagnose the problem? Should
>>>> I try rebooting the subnet manager switch? Will that affect the rest of the
>>>> fabric? 
>>> 
>>> Have you spoken to Cisco about the problem?  You say you can log into the
>>> "device" (the SM switch?) if so talk to Cisco about how you may be able to
>>> restart the SM there.
>>> 
>>> It does sound like the SM on the switch is failing to transition the links.
>>> If you can restart the SM on the switch I would try that first.  Otherwise 
>>> yes
>>> rebooting the switch is probably your best bet, and yes it will affect the
>>> fabric, although I can't say how much without knowing the topology.
>>> 
>>> Ira
>>> 
>>>> 
>>>> Thanks,
>>>> Mike Robbert
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
>>>> 
>>> 
>>> 
>>> -- 
>>> Ira Weiny
>>> Math Programmer/Computer Scientist
>>> Lawrence Livermore National Lab
>>> 925-423-8008
>>> wei...@llnl.gov
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to