Eugene,
I think I remember setting up the MTT tests on Sif so that tests
are run both with and without the coll_hierarch component selected.
The coll_hierarch component stresses code paths and potential
race conditions in its own way.  So, if the problems are showing up
more frequently for the test runs with the coll_hierarch component
enabled, then I would check the communicator creation code paths.

Now that I'm at SiCortex, I don't have time to look into these IU MTT
failures not that I had a bunch of time while at IU ;-), but you can get
to a lot of information with some work in the MTT reporter web page.
Also, hopefully Josh will have a little time to look into it.

Good luck!    -- Tim

On Fri, Mar 27, 2009 at 10:15 AM, Eugene Loh <eugene....@sun.com> wrote:
> Josh Hursey wrote:
>
>> Sif is also running the coll_hierarch component on some of those  tests
>> which has caused some additional problems. I don't know if that  is related
>> or not.
>
> Indeed.  Many of the MTT stack traces (for both 1.3.1 and 1.3.2 and that
> have seg faults and call out mca_btl_sm.so) do involve collectives and/or
> have mca_coll_hierarch.so in their stack traces.  I could well imagine this
> is the culprit, though I do not know for sure.
>
> Ralph Castain wrote:
>
>> Hmmm...Eugene, you need to be a tad less sensitive. Nobody was  attempting
>> to indict you or in any way attack you or your code.
>
> Yes, I know, though thank you for saying so.  I was overdoing the defensive
> rhetoric trying to be funny, but I confess it's nervous humor.  There was
> stuff in the sm code that I couldn't see how it was 100% robust.
>  Nevertheless, I let that style remain in the code with my changes...
> perhaps even pushing it a little bit.  My putbacks include a comment or two
> to that effect.  E.g.,
> https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/btl/sm/btl_sm.c?r=20774#523
> .  I understand why the occasional failures that Jeff and Terry saw did not
> hold up 1.3.1, but I'd really like to understand them and fix them.  But at
> 0.01% fail rate (<0.001% for me... I've never seen it in 100Ks of runs), all
> I can do about etiology and fixes is speculate.
>
> Okay, joke overdone and nervousness no longer funny.  Indeed, annoying.  I
> stop.
>
>> Since we clearly see problems on sif, and Josh has indicated a
>>  willingness to help with debugging, this might be a place to start the
>>  investigation. If asked nicely, they might even be willing to grant  access
>> to the machine, if that would help.
>
> Maybe a starting point would be running IU_Sif without coll_hierarch and
> seeing where we stand.
>
> And, again, my gut feel is that the failures are unrelated to the 0.01%
> failures that Jeff and Terry were seeing.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 timat...@sicortex.com || timat...@open-mpi.org
    I'm a bright... http://www.the-brights.net/

Reply via email to