Hi, On 12/13/2011 2:35 PM, Hector Abrach wrote: > Hello, > > I have a boot problem with OpenSM
Are you saying the switch is booted rather than OpenSM ? What is the OpenSM running on and in what environment ? > the problem occurs seldomly and > started to ocur when we started using a new Mellanox MT1118X03342 switch. > The problem occurs during the discovery phase within state_mgr_sweep_hop_1. > > However, I discovered that the actual location is because the > qp0_mads_outsanding stalls at 1 occasionally. Is it stuck or after timeout/retry does this get updated properly ? > Within file osm_vl15intf.c in function vl15_poller it checks at the > rfifo and if the qlist still has items it applies function vl15_send_mad > which later on triggers the signal. > With the current default setting of 4 for OSM_DEFAULT_SMP_MAX_ON_WIRE I > noticed that cl_qlist_end reaches zero before > stats->qp0_mads_outstanding does. This causes a stall in > cl_event_wait_on. The rfifo always reaches 0 when there are 4 > qp0_mads_outstanding however when it fails it always fails when there is > 1 qp0_mad_outstanding. Is some (request) SMP that OpenSM sent timing out (not being responded to) ? > Have you seen this failure? By the way, I see this failure once every 15 > reboots approximately. > > I discovered that changing OSM_DEFAULT_SMP_MAX_ON_WIRE to 1 fixes the > problem. What do you mean exactly by fixes the problem ? I'm not sure I understand what the problem is yet. -- Hal > My guess is that there is a race condition when the switch sends 4 SMPs > in parallel. Also, this failure only appears to occur at reboot. Another > solution which is not acceptable is when I add a delay in the process > the failure goes away. This as if the switch needed more time to do > something. > > I would really appreciate your help and insight. > Thank you > > Hector Abrach > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > > > _______________________________________________ > ewg mailing list > ewg@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg