On Thu, Feb 4, 2010 at 7:13 PM, Ira Weiny <wei...@llnl.gov> wrote: > On Thu, 4 Feb 2010 15:01:32 -0500 > Hal Rosenstock <hal.rosenst...@gmail.com> wrote: > >> On Thu, Feb 4, 2010 at 1:00 PM, Ira Weiny <wei...@llnl.gov> wrote: >> > On Thu, 4 Feb 2010 09:19:39 -0500 >> > Hal Rosenstock <hal.rosenst...@gmail.com> wrote: >> > >> >> On Tue, Feb 2, 2010 at 7:45 PM, Ira Weiny <wei...@llnl.gov> wrote: >> >> > Sasha, >> >> > > > [snip] > >> >> > >> >> > real 0m2.249s >> >> > user 0m1.244s >> >> > sys 0m0.936s >> >> > >> >> > 14:40:59 > time ./ibnetdiscover -o 4 --node-name-map >> >> > /etc/opensm/ib-node-name-map -g > new >> >> > >> >> > real 0m2.170s >> >> > user 0m1.160s >> >> > sys 0m0.933s >> >> > >> >> > 14:41:10 > /usr/sbin/ibqueryerrors -s >> >> > RcvErrors,SymbolErrors,RcvSwRelayErrors,XmtWait -r --data >> >> > Suppressing: RcvErrors SymbolErrors RcvSwRelayErrors XmtWait >> >> > Errors for 0x66a00d90006fb "SW19" >> >> > GUID 0x66a00d90006fb port 9: [VL15Dropped == 3] [XmtData == 25187379] >> >> > [RcvData == 25196688] [XmtPkts == 349861] [RcvPkts == 349954] >> >> > Link info: 139 9[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> >> >> > 0x0002c9030001d736 864 1[ ] "hyperion1" ( ) >> >> > >> >> > Note that there were no additional VL15Dropped packets on the fabric. >> >> > I think 4 seems to be a good compromise. I have not tested when there >> >> > are errors on the fabric. (Right now things seem to be good!) >> >> >> >> Is this just with the SM doing light sweeping ? >> > >> > Yes. >> >> That's not a lot of SMP stress from the SM side. SMP consumers are SM, >> diags, and the unsolicited traps. > > Agreed. I hope to test this more next week. >> >> > >> >> >> >> Is there a speedup with 4 rather than 2 ? >> > >> > There is a bit of a speed up (~0.5 to 1.0 sec). But my main reason to >> > want to >> > go to 4 is that if there are issues on the fabric, unresponsive nodes >> > etc.; 4 >> > will give us better parallelism to get around these issues. I have not had >> > the chance to test this condition with the new algorithm but the original >> > ibnetdiscover would slow way down when there are nodes which have >> > unresponsive >> > SMA's. If there are only 2 outstanding this will not give us much speed >> > up. >> > This was the main motivation I had for improving the library in this way. >> > >> > Also, I think you are correct that we should increase OpenSM's default >> > from 4 >> > to 8. For the same reason as above. Some of our clusters have worked >> > better >> > with 8 when we are having issues. But right now we are still running with >> > 4. >> >> I'm concerned about just increasing ibnetdiscover to 4 rather than 2. >> I've seen a number of clusters with SMP dropping with the current >> lower defaults. > > So OpenSM is seeing dropped packets?
OpenSM is seeing timeouts and there are VL15 drops in the subnet. > With 4 SMP's on the wire? Yes. > I do see some > VL15Dropped errors (maybe 2-3 a day) but I did not think that would be an > issue. What kind of rate are you seeing? > The other question is; do people regularly run the tools which are using > libibnetdisc (ibqueryerrors, iblinkinfo, ibnetdiscover)? These tools are being used (at least ibnetdiscover and ibqueryerrors). > We do. If others > are not then I would say this change would have less impact as they would want > the diags to have some priority for debugging. The other option is to change > the patch to be a default of 2 and allow user to change it depending on what > they are trying to do. If you think that is best I will change the patch. FWIW I think 2 is better until we have more exhaustive experience with 4. The other alternative would be to make it 4 and then see if people start noticing (more) VL15 drops and possibly other issues. -- Hal > Ira > >> >> -- Hal >> >> > Ira >> > >> >> >> >> -- Hal >> >> >> >> > >> >> > The first patch converts the algorithm and the second adds the >> >> > ibnd_set_max_smps_on_wire call. >> >> > >> >> > Let me know what you think. Because the algorithm changed so much >> >> > testing this is a bit difficult because the order of the node discovery >> >> > is different. However, I have done some extensive diffing of the >> >> > output of ibnetdiscover and things look good. >> >> > >> >> > Ira >> >> > >> >> > -- >> >> > Ira Weiny >> >> > Math Programmer/Computer Scientist >> >> > Lawrence Livermore National Lab >> >> > 925-423-8008 >> >> > wei...@llnl.gov >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> >> > the body of a message to majord...@vger.kernel.org >> >> > More majordomo info at http://**vger.kernel.org/majordomo-info.html >> >> > >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in >> >> the body of a message to majord...@vger.kernel.org >> >> More majordomo info at http://**vger.kernel.org/majordomo-info.html >> >> >> > >> > >> > -- >> > Ira Weiny >> > Math Programmer/Computer Scientist >> > Lawrence Livermore National Lab >> > 925-423-8008 >> > wei...@llnl.gov >> > >> > > > -- > Ira Weiny > Math Programmer/Computer Scientist > Lawrence Livermore National Lab > 925-423-8008 > wei...@llnl.gov > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html