RE: [openib-general] question on opensm error
Hi, There is a sys fail red light on the CPU on the 96-port switch that the opensm host attaches to. What's weird is none of the ib admin tools found anything. ibnetdiscover happily walked the whole subnet. The only problem was that opensm would not run, but the errors were unclear. So many things appeared to be working that it did not occur to me to walk over and look at the switch. Stupid of me. Still not 100% clear on the failure mode. I don't know what the sys fail light on the CPU means. It may mean that things partially work. By that, I mean the CPU might crash but the IB chips continue to function based on their current setup. It would depend on the split of functionality between the CPU and the IB chip firmware (which may depend on vendor). If you were able to walk the subnet with the (SMP based) diags, the SM port had to be at least in init (ibstat/ibstatus). The keys are what was the failure mode so we can see how this can be detected better in the future, and what caused the switch CPU to crash in the first place. -- Hal I totally agree with Hal. The switch's CPU error is not the bug that is in our concern. We should handle it is just as a failure of a device, and we should be able to either overcome such failure or at least be able to diagnose the error. If you are able to reproduce the situation, please do it while the SM is running with -V flag (full verbosity) and send the osm log file (/tmp/osm.log) to the list. This will help us understand what is the opensm problem. The output of the ibnetdiscover may help too. Thanks, Shahar ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on opensm error
On Wed, 2005-02-16 at 11:45, Ronald G. Minnich wrote: On Tue, 16 Feb 2005, Hal Rosenstock wrote: On Tue, 2005-02-15 at 22:22, Ronald G. Minnich wrote: On Tue, 15 Feb 2005, Hal Rosenstock wrote: I presume your subnet has 179 HCAs ? Do you know ? no errors. It's just that opensm won't run. Won't run or won't do anything on the subnet ? Not sure what you mean by won't run ? ok, just found it. There is a sys fail red light on the CPU on the 96-port switch that the opensm host attaches to. What's weird is none of the ib admin tools found anything. ibnetdiscover happily walked the whole subnet. The only problem was that opensm would not run, but the errors were unclear. So many things appeared to be working that it did not occur to me to walk over and look at the switch. Stupid of me. Still not 100% clear on the failure mode. I don't know what the sys fail light on the CPU means. It may mean that things partially work. By that, I mean the CPU might crash but the IB chips continue to function based on their current setup. It would depend on the split of functionality between the CPU and the IB chip firmware (which may depend on vendor). If you were able to walk the subnet with the (SMP based) diags, the SM port had to be at least in init (ibstat/ibstatus). The keys are what was the failure mode so we can see how this can be detected better in the future, and what caused the switch CPU to crash in the first place. -- Hal Now that I've turned that switch off I get this: [1108572233:000155763][40BFF970] - __osm_state_mgr_sm_port_down_msg: ** ** SM PORT DOWN ** ** [1108572233:000155778][40BFF970] - __osm_sm_state_mgr_signal_error: ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state IB_SMINFO_STATE_DISCOVERING. which I assume is its way of telling me that the switch port is down. ron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on opensm error
Hi Ron, On Mon, 2005-02-14 at 15:59, Ronald G. Minnich wrote: formerly working opensm starts to get these: So the OpenSM was up and running and these messages appeared in the log. Did anything change in the subnet ? [1108414727:000284173][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. [1108414727:000384171][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. [1108414727:000484169][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. These are failures of the OpenSM to send a SM Get(NodeInfo) which are used during the periodic subnet sweeps. I think the only way this error happens is if physical link is not present on the local link (e.g. logical link is not in init state or beyond). So was a cable pulled somewhere ? Is this problem intermittent ? Does it come and go for no apparent reason ? Are there any other messages in the log around this which might be useful ? -- Hal what's a reasonable thing to look for, or should I just svn update and hope for the best? thanks ron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on opensm error
Hi Ron, On Mon, 2005-02-14 at 15:59, Ronald G. Minnich wrote: formerly working opensm starts to get these: So the OpenSM was up and running and these messages appeared in the log. Did anything change in the subnet ? [1108414727:000284173][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. [1108414727:000384171][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. [1108414727:000484169][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. These are failures of the OpenSM to send a SM Get(NodeInfo) which are used during the periodic subnet sweeps. I think the only way this error happens is if physical link is not present on the local link (e.g. logical link is not in init state or beyond). So was a cable pulled somewhere ? Is this problem intermittent ? Does it come and go for no apparent reason ? Does the subnet get out of this state or do you need to restart OpenSM ? Are there any other messages in the log around this which might be useful ? Thanks. -- Hal what's a reasonable thing to look for, or should I just svn update and hope for the best? thanks ron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] question on opensm error
On Tue, 15 Feb 2005, Hal Rosenstock wrote: ibstatus/ibstat can show the local port logical and physical port state. bluesteel:~ # ibstat CA 'mthca0': CA type: MT23108 Number of ports: 2 Firmware version: 3.3.2 Hardware version: a1 Node GUID: 0x0002c90108a03e60 System image GUID: 0x0002c9000100d050 Port 1: State: Initializing Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0002c90108a03e61 Port 2: State: Down Rate: 2 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00500a68 Port GUID: 0x0002c90108a03e62 It might be helpful to try running ibnetdiscover -e (to show the errors). smpquery can also be used to query the bad link/host. no -e switch on my copy. svn update time? This was kind of interesting, it did find a lot of switches ... [0][1][3][8][7][3][3][2][8][5][8] - known remote switch {0002c90108d19748} portnum 0 lid 0xe4-0xe4 MT43132 Mellanox Technologies [0][1][3][8][7][3][3][2][8][2] - processing switch {0002c90108d19200} portnum 0 lid 0x0-0x0 MT43132 Mellanox Technologies (more like this -- much more) and some hcas [0][1][3][8][7][3][3][2][8][2][2] - new remote hca {0002c901081e6700} portnum 1 lid 0x0-0x0 MT23108 InfiniHost Mellanox Technologies [1] {0002c901081e6700} but osm.log is about 59MB of these: [1108475425:000915547][411FF970] - umad_receiver: send completed with error(method=1 attr=11) -- dropping. smpquery? Have not seen that. Remember I'm trying to get this done with openib ONLY. Probably a bad idea :-) here's plain ibnetdiscover bluesteel:~ # ibnetdiscover warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms) warn: [4710] _do_madrpc: send failed; Invalid argument warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][5][3][2][8][2][4] port 4 failed, skipping port warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms) warn: [4710] _do_madrpc: send failed; Invalid argument warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][4][1][1][2] port 2 failed, skipping port warn: [4710] _do_madrpc: retry 2 (timeout 2000 ms) warn: [4710] _do_madrpc: send failed; Invalid argument warn: [4710] handle_port: Nodeinfo on [0][1][3][8][7][2][3][1][8][4][2] port 2 failed, skipping port ron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general