Hi Lan, On 12:33 Thu 02 Aug , lbt wrote: > Hi Sasha, > > I am hitting a problem where the user level MAD library seems to be timing > out, causing the ports to be stuck in "INIT" state because the subnet has no > "Master" SM available. The system is still in this state, so if there are > any suggestions on what other type of debug info I could collect or clues to > what the problem might be, it would be much apprceciated :) > > I have 3 machines (OFED 1.1. stack, Opensm v2.0.5), where 2 of them are > running open SM, connected by an IB switch. Several tests were being done > pulling IB cables, but not touching at all the IB connections between the > Master SM and the IB switch, or rebooting the IB switch (i.e. no SM > migration should be occurring). Everything was working fine, until at one > point, I pull the IB cable on the IB switch of the lower priority (standby) > SM. For some reason, this starts causing problems on the higher priority > Master SM. The higher priority SM now thinks it's in Standby state, and the > lower priority SM's MAD packets are timing out. It is odd because, I would > not expect any effect on the higher priority SM (as it's IB connections are > not being affected). And not sure why MAD packets are timing out on the > lower priority SM. Rebooting the lower priority SM and replugging IB cables > into different ports on the IB switch, didn't help.
Is it reproducible or randomly happened problem? > Lower priority SM: (packets timeout) > [EMAIL PROTECTED] ~]# sminfo -d -e -P 1 > ibwarn: [26764] smp_query: attr 21 mod 0 route DR path [0] > ibwarn: [26764] mad_rpc: data offs 64 sz 64 > mad data > 0000 0000 0000 0000 fe80 0000 0000 0000 > 0003 0001 0251 0a6a 0000 0000 0103 0302 > 1252 0011 4040 0008 0804 ff40 0000 0000 > 0000 2012 1088 0000 0000 0000 0000 0000 > ibwarn: [26764] smp_query: attr 32 mod 0 route Lid 1 It is possible that Master SM dropped routing to lid 1 node (which was disconnected some time before Master became StandBy). I suppose sminfo using direct path should work. Sasha > ibwarn: [26764] _do_madrpc: retry 1 (timeout 1000 ms) > ibwarn: [26764] _do_madrpc: retry 2 (timeout 1000 ms) > ibwarn: [26764] _do_madrpc: timeout after 3 retries, 3000 ms > sminfo: iberror: [pid 26764] main: failed: query > > Higher priority SM: (thinks its Standby now) > [EMAIL PROTECTED] log]# sminfo -d -e -P 1 > ibwarn: [2487] smp_query: attr 21 mod 0 route DR path [0] > ibwarn: [2487] mad_rpc: data offs 64 sz 64 > mad data > 0000 0000 0000 0000 fe80 0000 0000 0000 > 0002 0003 0251 0a6a 0000 0000 0103 0302 > 1252 0011 4040 0008 0804 ff40 0000 0000 > 0000 2012 1088 0000 0000 0000 0000 0000 > ibwarn: [2487] smp_query: attr 32 mod 0 route Lid 3 > ibwarn: [2487] mad_rpc: data offs 64 sz 64 > mad data > 0050 4501 4a3a 0001 0000 0000 0000 0000 > 0000 020e 0200 0000 0000 0000 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 > 0000 0000 0000 0000 0000 0000 0000 0000 > sminfo: sm lid 3 sm guid 0x5045014a3a0001, activity count 526 priority 0 > state 2 SMINFO_STANDBY > > Just another data point, but each machine happens to have 2 HCA ports, port > 1 and port 2. Port 1 is connected to different subnet than port2. During all > these steps, port2 subnet is still fine and working OK. The problem > described above was being seen with the port 1 subnet only. > > Thanks! > Lan _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
