ibstat stuck in state initialized after reboot

2010-03-24 Thread Michael Robbert
I hope this is the correct place to get help with the problem I have. I have an 
IB fabric running on a Cisco SFS switch with a 7000D as the subnet manager and 
the whole thing has been running great for well over a year now, but today I 
noticed that after any node gets rebooted its IB link doesn't initialize. This 
has happened on 4 hosts now. What I see is as follows:

[r...@compute-2-7 ~]# ibstat
CA 'mthca0'
   CA type: MT25204
   Number of ports: 1
   Firmware version: 1.2.917
   Hardware version: 20
   Node GUID: 0x0005ad0c0990
   System image GUID: 0x0005ad000100d050
   Port 1:
   State: Initializing
   Physical state: LinkUp
   Rate: 20
   Base lid: 0
   LMC: 0
   SM lid: 0
   Capability mask: 0x02510a68
   Port GUID: 0x0005ad0c0991

I don't know much about subnet managers, since ours is in hardware and we've 
never had to configure anything on it, but I can login to the device and it 
isn't showing any errors. On a node that hasn't been rebooted recently and is 
still working I can see what appears to be a working subnet manager:

[r...@compute-2-10 ~]# sminfo 
sminfo: sm lid 2 sm guid 0x5ad1df2a0, activity count 2146213408 priority 10 
state 3 SMINFO_MASTER

The same command on a non-working node shows this:

[r...@compute-2-7 ~]# sminfo 
sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY

So far I have reseated all the cables involved on both ends and I have moved 
the cables on the switch end to new ports and none of that has made a 
difference even after reboots. I am hoping to find a node that I can take 
offline tomorrow so I can actually test the cables, but since this seems to be 
happening to any host that reboots it doesn't appear to be a cabling problem. 
Can anybody suggest where I should go from here? Is there anything I can do 
from a working or non-working host to diagnose the problem? Should I try 
rebooting the subnet manager switch? Will that affect the rest of the fabric? 

Thanks,
Mike Robbert
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Ira Weiny
On Wed, 24 Mar 2010 10:26:02 -0600
Michael Robbert mrobb...@mines.edu wrote:

 I hope this is the correct place to get help with the problem I have. I have
 an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
 manager and the whole thing has been running great for well over a year now,
 but today I noticed that after any node gets rebooted its IB link doesn't
 initialize. This has happened on 4 hosts now. What I see is as follows:
 
 [r...@compute-2-7 ~]# ibstat
 CA 'mthca0'
CA type: MT25204
Number of ports: 1
Firmware version: 1.2.917
Hardware version: 20
Node GUID: 0x0005ad0c0990
System image GUID: 0x0005ad000100d050
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 20
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510a68
Port GUID: 0x0005ad0c0991
 
 I don't know much about subnet managers, since ours is in hardware and we've
 never had to configure anything on it, but I can login to the device and it
 isn't showing any errors. On a node that hasn't been rebooted recently and
 is still working I can see what appears to be a working subnet manager:
 
 [r...@compute-2-10 ~]# sminfo 
 sminfo: sm lid 2 sm guid 0x5ad1df2a0, activity count 2146213408 priority 
 10 state 3 SMINFO_MASTER
 
 The same command on a non-working node shows this:
 
 [r...@compute-2-7 ~]# sminfo 
 sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 
 SMINFO_STANDBY
 
 So far I have reseated all the cables involved on both ends and I have moved
 the cables on the switch end to new ports and none of that has made a
 difference even after reboots. I am hoping to find a node that I can take
 offline tomorrow so I can actually test the cables, but since this seems to
 be happening to any host that reboots it doesn't appear to be a cabling
 problem. Can anybody suggest where I should go from here? Is there anything
 I can do from a working or non-working host to diagnose the problem? Should
 I try rebooting the subnet manager switch? Will that affect the rest of the
 fabric? 

Have you spoken to Cisco about the problem?  You say you can log into the
device (the SM switch?) if so talk to Cisco about how you may be able to
restart the SM there.

It does sound like the SM on the switch is failing to transition the links.
If you can restart the SM on the switch I would try that first.  Otherwise yes
rebooting the switch is probably your best bet, and yes it will affect the
fabric, although I can't say how much without knowing the topology.

Ira

 
 Thanks,
 Mike Robbert
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://*vger.kernel.org/majordomo-info.html
 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Michael Robbert
Ira,
Thanks for the quick response. That is what I was afraid of. I've been looking 
through the switch documentation, but it doesn't cover starting, stopping, or 
even checking the status of the SM service. I'll look into opening a TAC case, 
but since Cisco has gotten out of the IB business I'm not looking forward to 
seeing what kind of product support they still have. I can tell you a little 
more about our topology since it is pretty simple. All of our hosts are 
connected to the single large SFS switch, then the 7000D which is our 
subnet-manager is only plugged into that larger switch. 

Thanks for the help and wish me luck with support!

Mike

On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:

 On Wed, 24 Mar 2010 10:26:02 -0600
 Michael Robbert mrobb...@mines.edu wrote:
 
 I hope this is the correct place to get help with the problem I have. I have
 an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
 manager and the whole thing has been running great for well over a year now,
 but today I noticed that after any node gets rebooted its IB link doesn't
 initialize. This has happened on 4 hosts now. What I see is as follows:
 
 [r...@compute-2-7 ~]# ibstat
 CA 'mthca0'
   CA type: MT25204
   Number of ports: 1
   Firmware version: 1.2.917
   Hardware version: 20
   Node GUID: 0x0005ad0c0990
   System image GUID: 0x0005ad000100d050
   Port 1:
   State: Initializing
   Physical state: LinkUp
   Rate: 20
   Base lid: 0
   LMC: 0
   SM lid: 0
   Capability mask: 0x02510a68
   Port GUID: 0x0005ad0c0991
 
 I don't know much about subnet managers, since ours is in hardware and we've
 never had to configure anything on it, but I can login to the device and it
 isn't showing any errors. On a node that hasn't been rebooted recently and
 is still working I can see what appears to be a working subnet manager:
 
 [r...@compute-2-10 ~]# sminfo 
 sminfo: sm lid 2 sm guid 0x5ad1df2a0, activity count 2146213408 priority 
 10 state 3 SMINFO_MASTER
 
 The same command on a non-working node shows this:
 
 [r...@compute-2-7 ~]# sminfo 
 sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 
 SMINFO_STANDBY
 
 So far I have reseated all the cables involved on both ends and I have moved
 the cables on the switch end to new ports and none of that has made a
 difference even after reboots. I am hoping to find a node that I can take
 offline tomorrow so I can actually test the cables, but since this seems to
 be happening to any host that reboots it doesn't appear to be a cabling
 problem. Can anybody suggest where I should go from here? Is there anything
 I can do from a working or non-working host to diagnose the problem? Should
 I try rebooting the subnet manager switch? Will that affect the rest of the
 fabric? 
 
 Have you spoken to Cisco about the problem?  You say you can log into the
 device (the SM switch?) if so talk to Cisco about how you may be able to
 restart the SM there.
 
 It does sound like the SM on the switch is failing to transition the links.
 If you can restart the SM on the switch I would try that first.  Otherwise yes
 rebooting the switch is probably your best bet, and yes it will affect the
 fabric, although I can't say how much without knowing the topology.
 
 Ira
 
 
 Thanks,
 Mike Robbert
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://*vger.kernel.org/majordomo-info.html
 
 
 
 -- 
 Ira Weiny
 Math Programmer/Computer Scientist
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibstat stuck in state initialized after reboot

2010-03-24 Thread Meyer, Donald J
http://www.cisco.com/en/US/docs/server_nw_virtual/7024/release_4.1/hardware/installation/guide/7024hig.pdf

smControl
Starts and stops the embedded subnet manager.
Syntax:
smControl start | stop | restart | status

Thanks,
Don Meyer
Senior Network/System Engineer/Programmer
US+ (253) 371-9532 iNet 8-371-9532
*Other names and brands may be claimed as the property of others
-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Michael Robbert
Sent: Wednesday, March 24, 2010 10:00 AM
To: Ira Weiny
Cc: linux-rdma@vger.kernel.org
Subject: Re: ibstat stuck in state initialized after reboot

Ira,
Thanks for the quick response. That is what I was afraid of. I've been looking 
through the switch documentation, but it doesn't cover starting, stopping, or 
even checking the status of the SM service. I'll look into opening a TAC case, 
but since Cisco has gotten out of the IB business I'm not looking forward to 
seeing what kind of product support they still have. I can tell you a little 
more about our topology since it is pretty simple. All of our hosts are 
connected to the single large SFS switch, then the 7000D which is our 
subnet-manager is only plugged into that larger switch. 

Thanks for the help and wish me luck with support!

Mike

On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:

 On Wed, 24 Mar 2010 10:26:02 -0600
 Michael Robbert mrobb...@mines.edu wrote:
 
 I hope this is the correct place to get help with the problem I have. I have
 an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
 manager and the whole thing has been running great for well over a year now,
 but today I noticed that after any node gets rebooted its IB link doesn't
 initialize. This has happened on 4 hosts now. What I see is as follows:
 
 [r...@compute-2-7 ~]# ibstat
 CA 'mthca0'
   CA type: MT25204
   Number of ports: 1
   Firmware version: 1.2.917
   Hardware version: 20
   Node GUID: 0x0005ad0c0990
   System image GUID: 0x0005ad000100d050
   Port 1:
   State: Initializing
   Physical state: LinkUp
   Rate: 20
   Base lid: 0
   LMC: 0
   SM lid: 0
   Capability mask: 0x02510a68
   Port GUID: 0x0005ad0c0991
 
 I don't know much about subnet managers, since ours is in hardware and we've
 never had to configure anything on it, but I can login to the device and it
 isn't showing any errors. On a node that hasn't been rebooted recently and
 is still working I can see what appears to be a working subnet manager:
 
 [r...@compute-2-10 ~]# sminfo 
 sminfo: sm lid 2 sm guid 0x5ad1df2a0, activity count 2146213408 priority 
 10 state 3 SMINFO_MASTER
 
 The same command on a non-working node shows this:
 
 [r...@compute-2-7 ~]# sminfo 
 sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 
 SMINFO_STANDBY
 
 So far I have reseated all the cables involved on both ends and I have moved
 the cables on the switch end to new ports and none of that has made a
 difference even after reboots. I am hoping to find a node that I can take
 offline tomorrow so I can actually test the cables, but since this seems to
 be happening to any host that reboots it doesn't appear to be a cabling
 problem. Can anybody suggest where I should go from here? Is there anything
 I can do from a working or non-working host to diagnose the problem? Should
 I try rebooting the subnet manager switch? Will that affect the rest of the
 fabric? 
 
 Have you spoken to Cisco about the problem?  You say you can log into the
 device (the SM switch?) if so talk to Cisco about how you may be able to
 restart the SM there.
 
 It does sound like the SM on the switch is failing to transition the links.
 If you can restart the SM on the switch I would try that first.  Otherwise yes
 rebooting the switch is probably your best bet, and yes it will affect the
 fabric, although I can't say how much without knowing the topology.
 
 Ira
 
 
 Thanks,
 Mike Robbert
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://*vger.kernel.org/majordomo-info.html
 
 
 
 -- 
 Ira Weiny
 Math Programmer/Computer Scientist
 Lawrence Livermore National Lab
 925-423-8008
 wei...@llnl.gov

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Michael Robbert
I just discovered another interesting point. I tried to start opensm on one of 
my hosts and it went into STANDBY state. Here is the log of it trying to start 
up:

Mar 24 12:23:25 117170 [66DAC170] 0x80 - OpenSM 3.3.5
Entering DISCOVERING state

Mar 24 12:23:25 117863 [66DAC170] 0x02 - osm_vendor_init: 1000 pending umads 
specified
Mar 24 12:23:25 118022 [66DAC170] 0x80 - Entering DISCOVERING state
Mar 24 12:23:25 120961 [66DAC170] 0x02 - osm_vendor_bind: Binding to port 
0x5ad0bf1e1
Mar 24 12:23:25 129023 [66DAC170] 0x02 - osm_vendor_bind: Binding to port 
0x5ad0bf1e1
Mar 24 12:23:25 129069 [66DAC170] 0x02 - osm_opensm_bind: Setting IS_SM on 
port 0x0005ad0bf1e1
Mar 24 12:23:26 120384 [42E1E940] 0x01 - umad_receiver: ERR 5411: DR SMP Send 
completed with error -- dropping
Method 0x1, Attr 0x11, TID 0xf1a51, Hop Ptr: 0x0
Mar 24 12:23:26 120444 [42E1E940] 0x01 - Received SMP on a 4 hop path: Initial 
path = 0,0,0,0,0, Return path  = 0,0,0,0,0
Mar 24 12:23:26 120461 [42E1E940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1a51
Using default GUID 0x5ad0bf1e1
Entering STANDBY state

Mar 24 12:23:26 120538 [42C1D940] 0x80 - Entering STANDBY state

Does that change the diagnosis at all? I'm still waiting for a response from 
t...@cisco.com

Thanks,
Mike

On Mar 24, 2010, at 11:34 AM, Michael Robbert wrote:

 Interesting note! The 7024 is our large switch where all the hosts are 
 connected, but I was told that we were sold the 7000D because the 7024 didn't 
 have a subnet manager. Unfortunately the 7000D has a different CLI and that 
 command is not available and I don't have the password for our 7024 so I 
 can't log onto it. 
 On another note I just noticed the uptime on the 7000D is just over 1 day so 
 that must have been the start of the problem, but I have no idea why it 
 rebooted nor why it didn't come up working. I'm pretty sure we tested a 
 reboot of the device during acceptance testing.
 
 Oh, I just got your second note:
 ==
 BTW, I highly recommend running the opensm on a server instead of using the 
 sm on the switch.  We found running the sm on the switch was much less 
 reliable.  I also recommend using a server dedicated to opensm only.
 ==
 
 I will take that into consideration, but we bought this as a turn-key 
 solution from Dell. They designed it and we had no experience with IB so we 
 trusted their knowledge. 
 
 Thanks,
 Mike
 
 
 On Mar 24, 2010, at 11:12 AM, Meyer, Donald J wrote:
 
 http://www.cisco.com/en/US/docs/server_nw_virtual/7024/release_4.1/hardware/installation/guide/7024hig.pdf
 
 smControl
 Starts and stops the embedded subnet manager.
 Syntax:
 smControl start | stop | restart | status
 
 Thanks,
 Don Meyer
 Senior Network/System Engineer/Programmer
 US+ (253) 371-9532 iNet 8-371-9532
 *Other names and brands may be claimed as the property of others
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org 
 [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Michael Robbert
 Sent: Wednesday, March 24, 2010 10:00 AM
 To: Ira Weiny
 Cc: linux-rdma@vger.kernel.org
 Subject: Re: ibstat stuck in state initialized after reboot
 
 Ira,
 Thanks for the quick response. That is what I was afraid of. I've been 
 looking through the switch documentation, but it doesn't cover starting, 
 stopping, or even checking the status of the SM service. I'll look into 
 opening a TAC case, but since Cisco has gotten out of the IB business I'm 
 not looking forward to seeing what kind of product support they still have. 
 I can tell you a little more about our topology since it is pretty simple. 
 All of our hosts are connected to the single large SFS switch, then the 
 7000D which is our subnet-manager is only plugged into that larger switch. 
 
 Thanks for the help and wish me luck with support!
 
 Mike
 
 On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:
 
 On Wed, 24 Mar 2010 10:26:02 -0600
 Michael Robbert mrobb...@mines.edu wrote:
 
 I hope this is the correct place to get help with the problem I have. I 
 have
 an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
 manager and the whole thing has been running great for well over a year 
 now,
 but today I noticed that after any node gets rebooted its IB link doesn't
 initialize. This has happened on 4 hosts now. What I see is as follows:
 
 [r...@compute-2-7 ~]# ibstat
 CA 'mthca0'
 CA type: MT25204
 Number of ports: 1
 Firmware version: 1.2.917
 Hardware version: 20
 Node GUID: 0x0005ad0c0990
 System image GUID: 0x0005ad000100d050
 Port 1:
 State: Initializing
 Physical state: LinkUp
 Rate: 20
 Base lid: 0
 LMC: 0
 SM lid: 0
 Capability mask: 0x02510a68
 Port GUID: 0x0005ad0c0991
 
 I don't know much

Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Ira Weiny
On Wed, 24 Mar 2010 11:34:02 -0600
Michael Robbert mrobb...@mines.edu wrote:

 Interesting note! The 7024 is our large switch where all the hosts are
 connected, but I was told that we were sold the 7000D because the 7024
 didn't have a subnet manager. Unfortunately the 7000D has a different CLI
 and that command is not available and I don't have the password for our 7024
 so I can't log onto it. 

 On another note I just noticed the uptime on the 7000D is just over 1 day so
 that must have been the start of the problem, but I have no idea why it
 rebooted nor why it didn't come up working. I'm pretty sure we tested a
 reboot of the device during acceptance testing.
 
 Oh, I just got your second note:
 ==
 BTW, I highly recommend running the opensm on a server instead of using the
 sm on the switch.  We found running the sm on the switch was much less
 reliable.  I also recommend using a server dedicated to opensm only.
 ==

I will second this.  OpenSM has come a long way since the time Cisco was
selling IB switches.  If I understand your situation you don't even need the
7000D you could just remove it and run OpenSM on a management node.  If you
can afford it adding a node for OpenSM would be nice but I am not sure you
_need_ it.

OpenSM is now managing many of the largest IB networks out there, on a 288
node system it will have no problems at all out of the box.

:D

Ira
 
 I will take that into consideration, but we bought this as a turn-key
 solution from Dell. They designed it and we had no experience with IB so we
 trusted their knowledge. 

snip
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Chuck Hartley
On Wed, Mar 24, 2010 at 2:25 PM, Ira Weiny wei...@llnl.gov wrote:
 On Wed, 24 Mar 2010 11:34:02 -0600
 Michael Robbert mrobb...@mines.edu wrote:

 I will second this.  OpenSM has come a long way since the time Cisco was
 selling IB switches.  If I understand your situation you don't even need the
 7000D you could just remove it and run OpenSM on a management node.  If you
 can afford it adding a node for OpenSM would be nice but I am not sure you
 _need_ it.

 OpenSM is now managing many of the largest IB networks out there, on a 288
 node system it will have no problems at all out of the box.


Can you provide any guidelines to determine when a dedicated
management node is beneficial?

BTW, we also found that OpenSM is superior to to the SM embedded in
our switches.

Chuck
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Michael Robbert
I've got good news. I was able to get opensm to take control. I gave it a 
priority of 15 and rebooted the 7000D. Unfortunately I'm not sure I can leave 
it like this forever. The only host I had with opensm installed is my test 
front end for an OS upgrade I'm testing. We're moving from Rocks 4.3 to Rocks 
5.3 (RHEL 4.5 to RHEL 5.4). I may need to reboot this node from time to time 
over the next couple of weeks, but at least I'm working right now.
So you say that a 288 node system will work out of the box, what happens when 
you hit 289? Is that a magic number or just an estimate. We have 268 compute 
nodes plus a few auxiliary nodes so we're pretty close to that number. 

Thanks,
Mike

On Mar 24, 2010, at 12:25 PM, Ira Weiny wrote:

 On Wed, 24 Mar 2010 11:34:02 -0600
 Michael Robbert mrobb...@mines.edu wrote:
 
 Interesting note! The 7024 is our large switch where all the hosts are
 connected, but I was told that we were sold the 7000D because the 7024
 didn't have a subnet manager. Unfortunately the 7000D has a different CLI
 and that command is not available and I don't have the password for our 7024
 so I can't log onto it. 
 
 On another note I just noticed the uptime on the 7000D is just over 1 day so
 that must have been the start of the problem, but I have no idea why it
 rebooted nor why it didn't come up working. I'm pretty sure we tested a
 reboot of the device during acceptance testing.
 
 Oh, I just got your second note:
 ==
 BTW, I highly recommend running the opensm on a server instead of using the
 sm on the switch.  We found running the sm on the switch was much less
 reliable.  I also recommend using a server dedicated to opensm only.
 ==
 
 I will second this.  OpenSM has come a long way since the time Cisco was
 selling IB switches.  If I understand your situation you don't even need the
 7000D you could just remove it and run OpenSM on a management node.  If you
 can afford it adding a node for OpenSM would be nice but I am not sure you
 _need_ it.
 
 OpenSM is now managing many of the largest IB networks out there, on a 288
 node system it will have no problems at all out of the box.
 
 :D
 
 Ira
 
 I will take that into consideration, but we bought this as a turn-key
 solution from Dell. They designed it and we had no experience with IB so we
 trusted their knowledge. 
 
 snip
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ibstat stuck in state initialized after reboot

2010-03-24 Thread Ira Weiny
On Wed, 24 Mar 2010 13:42:55 -0600
Michael Robbert mrobb...@mines.edu wrote:

 I've got good news. I was able to get opensm to take control. I gave it a
 priority of 15 and rebooted the 7000D. Unfortunately I'm not sure I can
 leave it like this forever. The only host I had with opensm installed is my
 test front end for an OS upgrade I'm testing. We're moving from Rocks 4.3 to
 Rocks 5.3 (RHEL 4.5 to RHEL 5.4). I may need to reboot this node from time
 to time over the next couple of weeks, but at least I'm working right now.

 So you say that a 288 node system will work out of the box, what happens
 when you hit 289? Is that a magic number or just an estimate. We have 268
 compute nodes plus a few auxiliary nodes so we're pretty close to that
 number. 

Nothing will happen when you hit 289.  I chose that number because a 7024 has
288 ports which I assumed was the size of your cluster.

There are those running large clusters (thousands of nodes) who have made some
changes to OpenSM for specialized topologies or better SA scalability.  In the
future those changes should be in OpenSM so as you grow, OpenSM grows with
you!

:-D

Ira

 
 Thanks,
 Mike
 
 On Mar 24, 2010, at 12:25 PM, Ira Weiny wrote:
 
  On Wed, 24 Mar 2010 11:34:02 -0600
  Michael Robbert mrobb...@mines.edu wrote:
  
  Interesting note! The 7024 is our large switch where all the hosts are
  connected, but I was told that we were sold the 7000D because the 7024
  didn't have a subnet manager. Unfortunately the 7000D has a different CLI
  and that command is not available and I don't have the password for our 
  7024
  so I can't log onto it. 
  
  On another note I just noticed the uptime on the 7000D is just over 1 day 
  so
  that must have been the start of the problem, but I have no idea why it
  rebooted nor why it didn't come up working. I'm pretty sure we tested a
  reboot of the device during acceptance testing.
  
  Oh, I just got your second note:
  ==
  BTW, I highly recommend running the opensm on a server instead of using the
  sm on the switch.  We found running the sm on the switch was much less
  reliable.  I also recommend using a server dedicated to opensm only.
  ==
  
  I will second this.  OpenSM has come a long way since the time Cisco was
  selling IB switches.  If I understand your situation you don't even need the
  7000D you could just remove it and run OpenSM on a management node.  If 
  you
  can afford it adding a node for OpenSM would be nice but I am not sure you
  _need_ it.
  
  OpenSM is now managing many of the largest IB networks out there, on a 288
  node system it will have no problems at all out of the box.
  
  :D
  
  Ira
  
  I will take that into consideration, but we bought this as a turn-key
  solution from Dell. They designed it and we had no experience with IB so we
  trusted their knowledge. 
  
  snip
  
 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
wei...@llnl.gov
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html