RE: [openib-general] question on opensm error

2005-02-17 Thread shaharf
Hi,

  There is a sys fail red light on the CPU on the 96-port switch that
the
  opensm host attaches to.
 
  What's weird is none of the ib admin tools found anything.
ibnetdiscover
  happily walked the whole subnet. The only problem was that opensm
would
  not run, but the errors were unclear. So many things appeared to be
  working that it did not occur to me to walk over and look at the
switch.
  Stupid of me.
 
 Still not 100% clear on the failure mode. I don't know what the sys
fail
 light on the CPU means. It may mean that things partially work. By
that,
 I mean the CPU might crash but the IB chips continue to function based
 on their current setup. It would depend on the split of functionality
 between the CPU and the IB chip firmware (which may depend on vendor).
 
 If you were able to walk the subnet with the (SMP based) diags, the SM
 port had to be at least in init (ibstat/ibstatus).
 
 The keys are what was the failure mode so we can see how this can be
 detected better in the future, and what caused the switch CPU to crash
 in the first place.
 
 -- Hal
 

I totally agree with Hal. The switch's CPU error is not the bug that is
in our concern. We should handle it is just as a failure of a device,
and we should be able to either overcome such failure or at least be
able to diagnose the error.
If you are able to reproduce the situation, please do it while the SM is
running with -V flag (full verbosity) and send the osm log file
(/tmp/osm.log) to the list. This will help us understand what is the
opensm problem. The output of the ibnetdiscover may help too.

Thanks,
Shahar

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] question on opensm error

2005-02-16 Thread Hal Rosenstock
On Wed, 2005-02-16 at 11:45, Ronald G. Minnich wrote:
 On Tue, 16 Feb 2005, Hal Rosenstock wrote:
 
  On Tue, 2005-02-15 at 22:22, Ronald G. Minnich wrote:
   On Tue, 15 Feb 2005, Hal Rosenstock wrote:
   
I presume your subnet has 179 HCAs ? Do you know ?
   
   no errors. It's just that opensm won't run. 
  
  Won't run or won't do anything on the subnet ?
  
  Not sure what you mean by won't run ?
 
 ok, just found it. 
 
 There is a sys fail red light on the CPU on the 96-port switch that the
 opensm host attaches to.
 
 What's weird is none of the ib admin tools found anything. ibnetdiscover 
 happily walked the whole subnet. The only problem was that opensm would 
 not run, but the errors were unclear. So many things appeared to be 
 working that it did not occur to me to walk over and look at the switch. 
 Stupid of me. 

Still not 100% clear on the failure mode. I don't know what the sys fail
light on the CPU means. It may mean that things partially work. By that,
I mean the CPU might crash but the IB chips continue to function based
on their current setup. It would depend on the split of functionality
between the CPU and the IB chip firmware (which may depend on vendor).

If you were able to walk the subnet with the (SMP based) diags, the SM
port had to be at least in init (ibstat/ibstatus).

The keys are what was the failure mode so we can see how this can be
detected better in the future, and what caused the switch CPU to crash
in the first place.

-- Hal

 Now that I've turned that switch off I get this:
 [1108572233:000155763][40BFF970] - __osm_state_mgr_sm_port_down_msg: 
 
 
 **
 ** SM PORT DOWN **
 **
 
 
 [1108572233:000155778][40BFF970] - __osm_sm_state_mgr_signal_error: ERR 
 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state 
 IB_SMINFO_STATE_DISCOVERING.
 
 which I assume is its way of telling me that the switch port is down. 
 
 ron

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] question on opensm error

2005-02-15 Thread Hal Rosenstock
Hi Ron,

On Mon, 2005-02-14 at 15:59, Ronald G. Minnich wrote:
 formerly working opensm starts to get these:

So the OpenSM was up and running and these messages appeared in the log.
Did anything change in the subnet ?

 [1108414727:000284173][411FF970] - umad_receiver: send completed with 
 error(method=1 attr=11) -- dropping.
 [1108414727:000384171][411FF970] - umad_receiver: send completed with 
 error(method=1 attr=11) -- dropping.
 [1108414727:000484169][411FF970] - umad_receiver: send completed with 
 error(method=1 attr=11) -- dropping.

These are failures of the OpenSM to send a SM Get(NodeInfo) which are
used during the periodic subnet sweeps. I think the only way this error
happens is if physical link is not present on the local link (e.g.
logical link is not in init state or beyond). 

So was a cable pulled somewhere ? 

Is this problem intermittent ? Does it come and go for no apparent
reason ? 

Are there any other messages in the log around this which might be
useful ? 

-- Hal

 
 
 
 what's a reasonable thing to look for, or should I just svn update and 
 hope for the best?
 
 thanks
 
 ron
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] question on opensm error

2005-02-15 Thread Hal Rosenstock
Hi Ron,

On Mon, 2005-02-14 at 15:59, Ronald G. Minnich wrote:
 formerly working opensm starts to get these:

So the OpenSM was up and running and these messages appeared in the log.
Did anything change in the subnet ?

 [1108414727:000284173][411FF970] - umad_receiver: send completed with 
 error(method=1 attr=11) -- dropping.
 [1108414727:000384171][411FF970] - umad_receiver: send completed with 
 error(method=1 attr=11) -- dropping.
 [1108414727:000484169][411FF970] - umad_receiver: send completed with 
 error(method=1 attr=11) -- dropping.

These are failures of the OpenSM to send a SM Get(NodeInfo) which are
used during the periodic subnet sweeps. I think the only way this error
happens is if physical link is not present on the local link (e.g.
logical link is not in init state or beyond). 

So was a cable pulled somewhere ? 

Is this problem intermittent ? Does it come and go for no apparent
reason ? Does the subnet get out of this state or do you need to 
restart OpenSM ?

Are there any other messages in the log around this which might be
useful ? 

Thanks.

-- Hal

 
 
 
 what's a reasonable thing to look for, or should I just svn update and 
 hope for the best?
 
 thanks
 
 ron
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] question on opensm error

2005-02-15 Thread Ronald G. Minnich


On Tue, 15 Feb 2005, Hal Rosenstock wrote:

 ibstatus/ibstat can show the local port logical and physical port state.

bluesteel:~ # ibstat
CA 'mthca0':
CA type: MT23108
Number of ports: 2
Firmware version: 3.3.2
Hardware version: a1
Node GUID: 0x0002c90108a03e60
System image GUID: 0x0002c9000100d050
Port 1:
State: Initializing
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00500a68
Port GUID: 0x0002c90108a03e61
Port 2:
State: Down
Rate: 2
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00500a68
Port GUID: 0x0002c90108a03e62


 It might be helpful to try running ibnetdiscover -e (to show the
 errors). smpquery can also be used to query the bad link/host.

no -e switch on my copy. svn update time? 

This was kind of interesting, it did find a lot of switches ...
[0][1][3][8][7][3][3][2][8][5][8] - known remote switch 
{0002c90108d19748} portnum 0 lid 0xe4-0xe4 MT43132 Mellanox Technologies
[0][1][3][8][7][3][3][2][8][2] - processing switch {0002c90108d19200} 
portnum 0 lid 0x0-0x0 MT43132 Mellanox Technologies

(more like this -- much more)

and some hcas
[0][1][3][8][7][3][3][2][8][2][2] - new remote hca {0002c901081e6700} 
portnum 1 lid 0x0-0x0 MT23108 InfiniHost Mellanox Technologies
[1] {0002c901081e6700}

but osm.log is about 59MB of these:
[1108475425:000915547][411FF970] - umad_receiver: send completed with 
error(method=1 attr=11) -- dropping.

smpquery? Have not seen that. Remember I'm trying to get this done with 
openib ONLY. Probably a bad idea :-)



here's plain ibnetdiscover

bluesteel:~ # ibnetdiscover 
warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
warn: [4710] _do_madrpc: send failed; Invalid argument
warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][5][3][2][8][2][4] 
port 4 failed, skipping port
warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
warn: [4710] _do_madrpc: send failed; Invalid argument
warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][4][1][1][2] 
port 2 failed, skipping port
warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms)
warn: [4710] _do_madrpc: send failed; Invalid argument
warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][1][8][4][2] 
port 2 failed, skipping port

ron
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general