On 11:44 Mon 06 Mar , Jean-Christophe Hugly wrote: > > One more detail, I am running with LMC=2 betcause I wanted to check that > the LMC>0 were fixed (they seem to be; I do not see any LMC-related > missbehaviour.
Hmm, and I have the some problems with LMC (even before the test, not investigated yet)... Could you try without LMC? Sasha. > With -d1 everything looks shipshape). > > > Also I see that finally port becomes active but after delay. Those > > delays look strange and inconsistent, I will need to test more tomorrow. > > Could you try such modification for your script? > > > > i=1 > > while true; do > > modprobe -r ib_mthca > > sleep 3 > > modprobe ib_mthca > > count=0 > > while true ; do > > ibstat | egrep 'State: Active$' > /dev/null > > test $? -eq 0 && break > > count=`expr $count + 1` > > sleep 1 > > done > > echo $i: delay $count > > sleep 3 > > i=`expr $i + 1` > > done > > > Here's the output from your script. After the last line in doesn't make > further progress (I waited something like 10 minutes). > Addressing Eitan comment, I tried the same thing with a delay of 7 > seconds rather than 3 between modprobe -r and modprobe. The results are > the same: > > 1: delay 0 > 2: delay 0 > 3: delay 0 > 4: delay 0 > 5: delay 0 > 6: delay 0 > <nothing happens> > > In case it contains usefull clues, here's a sample of osm's log at > around the point things start falling appart: > > Mar 06 11:31:36 036291 [40A04960] -> __osm_trap_rcv_process_request: Received > Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 > TID:0x00000000000000c4 > Mar 06 11:31:36 036452 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:1 num:128 from LID:0x0000 > GID:0xfe80000000000000,0x001393010b186ba0 > Mar 06 11:31:36 044333 [40A04960] -> __osm_trap_rcv_process_request: Received > Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 > TID:0x00000000000000c8 > Mar 06 11:31:36 044921 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:1 num:128 from LID:0x0000 > GID:0xfe80000000000000,0x001393010b186b08 > Mar 06 11:31:36 056540 [40401960] -> osm_report_notice: Reporting Generic > Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 056562 [40401960] -> Discovered new port with > GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx > Mellanox Technologies > Mar 06 11:31:36 056570 [40401960] -> osm_report_notice: Reporting Generic > Notice type:3 num:64 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 056578 [40401960] -> Discovered new port with > GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx > Mellanox Technologies > Mar 06 11:31:36 056673 [40401960] -> osm_ucast_mgr_process: Min Hop Tables > configured on all switches > Mar 06 11:31:36 082257 [40A04960] -> osm_ucast_mgr_process: Min Hop Tables > configured on all switches > Mar 06 11:31:36 446369 [40602960] -> __osm_trap_rcv_process_request: Received > Generic Notice type:0x04 num:144 Producer:1 from LID:0x0010 > TID:0x0000000000000000 > Mar 06 11:31:36 446400 [40401960] -> __osm_trap_rcv_process_request: Received > Generic Notice type:0x04 num:144 Producer:1 from LID:0x0014 > TID:0x0000000000000001 > Mar 06 11:31:36 446614 [40602960] -> osm_report_notice: Reporting Generic > Notice type:4 num:144 from LID:0x0010 > GID:0xfe80000000000000,0x001393000024a511 > Mar 06 11:31:36 446657 [40401960] -> osm_report_notice: Reporting Generic > Notice type:4 num:144 from LID:0x0014 > GID:0xfe80000000000000,0x001393000024a512 > Mar 06 11:31:36 465919 [40401960] -> osm_ucast_mgr_process: Min Hop Tables > configured on all switches > Mar 06 11:31:36 473124 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 473151 [40A04960] -> Removed port with > GUID:0x001393000024a601 LID range [0x18,0x1B] of node:MT25218 InfiniHostEx > Mellanox Technologies > Mar 06 11:31:36 473196 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 473209 [40A04960] -> Removed port with > GUID:0x001393000024a602 LID range [0x1C,0x1F] of node:MT25218 InfiniHostEx > Mellanox Technologies > Mar 06 11:31:36 473526 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 473568 [40A04960] -> Removed port with > GUID:0x001393010b186b08 LID range [0x8,0x8] of node:MT47396 Infiniscale-III > Mellanox Technologies > Mar 06 11:31:36 473710 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 473722 [40A04960] -> Removed port with > GUID:0x001393000024a511 LID range [0x10,0x13] of node:MT25218 InfiniHostEx > Mellanox Technologies > Mar 06 11:31:36 473758 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 473770 [40A04960] -> Removed port with > GUID:0x001393000024a512 LID range [0x14,0x17] of node:MT25218 InfiniHostEx > Mellanox Technologies > Mar 06 11:31:36 474015 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 474050 [40A04960] -> Removed port with > GUID:0x001393010b186ba0 LID range [0x9,0x9] of node:MT47396 Infiniscale-III > Mellanox Technologies > Mar 06 11:31:36 474133 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 474165 [40A04960] -> Removed port with > GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost > Mellanox Technologies > Mar 06 11:31:36 474238 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:36 474249 [40A04960] -> Removed port with > GUID:0x0002c90200007afe LID range [0xC,0xF] of node:MT23108 InfiniHost > Mellanox Technologies > Mar 06 11:31:36 474267 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2756 > Mar 06 11:31:36 474283 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2758 > Mar 06 11:31:36 474541 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR > 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd > Mar 06 11:31:36 474577 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2757 > Mar 06 11:31:36 474807 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x275a > Mar 06 11:31:36 474827 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2759 > Mar 06 11:31:36 474814 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x275b > Mar 06 11:31:36 474903 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x275c > Mar 06 11:31:36 474999 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x275f > Mar 06 11:31:36 475003 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x275e > Mar 06 11:31:36 475024 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2760 > Mar 06 11:31:36 475038 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x275d > Mar 06 11:31:36 475089 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2761 > Mar 06 11:31:36 475140 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2762 > Mar 06 11:31:36 475158 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2763 > Mar 06 11:31:36 475173 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2764 > Mar 06 11:31:36 475231 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2765 > Mar 06 11:31:36 475248 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2766 > Mar 06 11:31:36 475295 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2767 > Mar 06 11:31:36 475332 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2768 > Mar 06 11:31:36 475367 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x276a > Mar 06 11:31:36 475350 [40803960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x2769 > Mar 06 11:31:36 475432 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x276c > Mar 06 11:31:36 475416 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x276b > Mar 06 11:31:36 475492 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x276d > Mar 06 11:31:36 475522 [40A04960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x276e > Mar 06 11:31:36 475634 [40401960] -> __osm_state_mgr_signal_error: ERR 3303: > Invalid signal OSM_SIGNAL_NO_PENDING_TRANSACTIONS(3) in state > OSM_SM_STATE_IDLE > Mar 06 11:31:38 040389 [40602960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:38 040409 [40602960] -> Removed port with > GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III > Mellanox Technologies > Mar 06 11:31:38 040419 [40602960] -> __osm_drop_mgr_remove_switch: ERR 0102: > Node 0x001393010b186ba0 not in switch table > Mar 06 11:31:38 040463 [40602960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:38 040474 [40602960] -> Removed port with > GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost > Mellanox Technologies > Mar 06 11:31:38 040486 [40803960] -> osm_si_rcv_process: ERR 3606: SwitchInfo > received for nonexistent node with GUID = 0x1393010b186ba0 > Mar 06 11:31:38 040587 [40602960] -> __osm_lid_mgr_process_our_sm_node: ERR > 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd > Mar 06 11:31:44 280928 [40401960] -> __osm_trap_rcv_process_request: Received > Generic Notice type:0x01 num:128 Producer:2 from LID:0x0009 > TID:0x00000000000000c5 > Mar 06 11:31:44 280976 [40A04960] -> __osm_trap_rcv_process_request: Received > Generic Notice type:0x01 num:128 Producer:2 from LID:0x0008 > TID:0x00000000000000c9 > Mar 06 11:31:44 282252 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:44 282266 [40A04960] -> Removed port with > GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III > Mellanox Technologies > Mar 06 11:31:44 282274 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: > Node 0x001393010b186ba0 not in switch table > Mar 06 11:31:44 282304 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:44 282315 [40A04960] -> Removed port with > GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost > Mellanox Technologies > Mar 06 11:31:44 282327 [40602960] -> osm_si_rcv_process: ERR 3606: SwitchInfo > received for nonexistent node with GUID = 0x1393010b186ba0 > Mar 06 11:31:44 282441 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR > 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd > Mar 06 11:31:44 283808 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:44 283821 [40A04960] -> Removed port with > GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III > Mellanox Technologies > Mar 06 11:31:44 283829 [40A04960] -> __osm_drop_mgr_remove_switch: ERR 0102: > Node 0x001393010b186ba0 not in switch table > Mar 06 11:31:44 283859 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:44 283869 [40A04960] -> Removed port with > GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost > Mellanox Technologies > Mar 06 11:31:44 283882 [40401960] -> osm_si_rcv_process: ERR 3606: SwitchInfo > received for nonexistent node with GUID = 0x1393010b186ba0 > Mar 06 11:31:44 283967 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR > 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd > Mar 06 11:31:48 047137 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:48 047201 [40A04960] -> Removed port with > GUID:0x001393010b186ba0 LID range [0x0,0x0] of node:MT47396 Infiniscale-III > Mellanox Technologies > Mar 06 11:31:48 047290 [40A04960] -> osm_report_notice: Reporting Generic > Notice type:3 num:65 from LID:0x0004 GID:0xfe80000000000000,0x0002c90200007afd > Mar 06 11:31:48 047310 [40A04960] -> Removed port with > GUID:0x0002c90200007afd LID range [0x4,0x7] of node:MT23108 InfiniHost > Mellanox Technologies > Mar 06 11:31:48 047451 [40A04960] -> __osm_lid_mgr_process_our_sm_node: ERR > 0308: Can't acquire SM's Port object, GUID = 0x0002c90200007afd > Mar 06 11:31:48 047537 [40602960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > for parent node GUID = 0x1393010b186ba0, TID > = 0x278d > Mar 06 11:31:48 047543 [40401960] -> osm_pi_rcv_process: ERR 0F06: No Port > object for port with GUID = 0x1393010b186ba0 > > -- > Jean-Christophe Hugly <[EMAIL PROTECTED]> > PANTA > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
