Eugene, I think I remember setting up the MTT tests on Sif so that tests are run both with and without the coll_hierarch component selected. The coll_hierarch component stresses code paths and potential race conditions in its own way. So, if the problems are showing up more frequently for the test runs with the coll_hierarch component enabled, then I would check the communicator creation code paths.
Now that I'm at SiCortex, I don't have time to look into these IU MTT failures not that I had a bunch of time while at IU ;-), but you can get to a lot of information with some work in the MTT reporter web page. Also, hopefully Josh will have a little time to look into it. Good luck! -- Tim On Fri, Mar 27, 2009 at 10:15 AM, Eugene Loh <eugene....@sun.com> wrote: > Josh Hursey wrote: > >> Sif is also running the coll_hierarch component on some of those tests >> which has caused some additional problems. I don't know if that is related >> or not. > > Indeed. Many of the MTT stack traces (for both 1.3.1 and 1.3.2 and that > have seg faults and call out mca_btl_sm.so) do involve collectives and/or > have mca_coll_hierarch.so in their stack traces. I could well imagine this > is the culprit, though I do not know for sure. > > Ralph Castain wrote: > >> Hmmm...Eugene, you need to be a tad less sensitive. Nobody was attempting >> to indict you or in any way attack you or your code. > > Yes, I know, though thank you for saying so. I was overdoing the defensive > rhetoric trying to be funny, but I confess it's nervous humor. There was > stuff in the sm code that I couldn't see how it was 100% robust. > Nevertheless, I let that style remain in the code with my changes... > perhaps even pushing it a little bit. My putbacks include a comment or two > to that effect. E.g., > https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/sm/btl_sm.c?r=20774#523 > . I understand why the occasional failures that Jeff and Terry saw did not > hold up 1.3.1, but I'd really like to understand them and fix them. But at > 0.01% fail rate (<0.001% for me... I've never seen it in 100Ks of runs), all > I can do about etiology and fixes is speculate. > > Okay, joke overdone and nervousness no longer funny. Indeed, annoying. I > stop. > >> Since we clearly see problems on sif, and Josh has indicated a >> willingness to help with debugging, this might be a place to start the >> investigation. If asked nicely, they might even be willing to grant access >> to the machine, if that would help. > > Maybe a starting point would be running IU_Sif without coll_hierarch and > seeing where we stand. > > And, again, my gut feel is that the failures are unrelated to the 0.01% > failures that Jeff and Terry were seeing. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ timat...@sicortex.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/