FWIW, when I was looking into this before, the problem was definitely
during MPI_INIT. I ran out of time before being able to track it down
further, but it was definitely something during the sm startup --
during add_procs, IIRC.
It *looked* like there was some kind of bogus value in the bootstrap
shared memory segment, but I was having great difficulty tracking it
down because the corefiles that are left when the job segv's does
*not* include the shared memory segment. The problem occurred on a
line in the source code where we were accessing values in the
bootstrap shared memory segment, and that's not in the corefile. So
you can't tell exactly what went wrong. :-\
On Mar 27, 2009, at 5:05 AM, Ralph Castain wrote:
Hmmm...Eugene, you need to be a tad less sensitive. Nobody was
attempting to indict you or in any way attack you or your code.
What I was attempting to point out is that there are a number of sm
failures during sm init. I didn't single you out. I posted it to the
community because (a) it is a persistent problem, as you yourself
note, that involves code from a number of people; (b) it is something
users have reported; and (c) it is clearly a race condition, which
means it will be very difficult to chase down.
So please stop the defensive rhetoric - we are not about assigning
blame, but rather about getting the code to work right.
Since we clearly see problems on sif, and Josh has indicated a
willingness to help with debugging, this might be a place to start the
investigation. If asked nicely, they might even be willing to grant
access to the machine, if that would help.
Whether or not we fix this for 1.3.2 is a community decision. At some
point, though, we are going to have to resolve this problem.
Thanks
Ralph
On Mar 26, 2009, at 11:39 PM, Eugene Loh wrote:
> Ralph Castain wrote:
>
>> You are correct - the Sun errors are in a version prior to the
>> insertion of the SM changes. We didn't relabel the version to
>> 1.3.2 until -after- those changes went in, so you have to look for
>> anything with an r number >= 20839.
>>
>> The sif errors are all in that group - I would suggest starting
>> there.
>>
>> I suspect Josh or someone at IU could tell you the compiler. I
>> would be very surprised if it wasn't gcc, but I don't know what
>> version. I suspect they could even find a way to run some
>> debugging on for you, if that would help.
>
> Okay, right now I'm not worried about compiler.
>
> My attorneys advised me not to speak to the public, but I share with
> you this
> prepared statement. :^)
>
> I don't claim my code is clean. Honestly, there was sm BTL code
> that worried
> me and I can't claim to have "done no worse" in the changes I made.
> But, this
> spate of test failures doesn't indict me. (Geez, sorry for being so
> defensive.
> I guess I just worry myself!)
>
> Let's start with the recent test results you indicated. Say,
> http://www.open-mpi.org/mtt/index.php?do_redir=973 which shows these
> failures:
>
> 143 on IU_Sif
> 28 on Sun/Linux (row #6 at that URL, I guess, but you said 34?)
> 3 on Sun/SunOS (row #7)
>
> But, I guess we agreed that the Sun/Linux and Sun/SunOS failures are
> with 1.3.1,
> and therefore are not attributable to single-queue changes.
>
> So now we look at recent history for IU_Sif. E.g.,
> http://www.open-mpi.org/mtt/index.php?do_redir=975
> Here is what I see:
>
> # MPI name MPI version MPI install Test build Test run
> Pass Fail Pass Fail Pass
> Fail pass:fail ratio
> 1 ompi-nightly-trunk 1.4a1r20771 6 0 24 0 10585
> 11 962
> 2 ompi-nightly-trunk 1.4a1r20777 6 0 24 0 11880
> 20 594
> 3 ompi-nightly-trunk 1.4a1r20781 12 0 48 0 23759
> 95 250
> 4 ompi-nightly-trunk 1.4a1r20793 12 0 48 0 23822
> 61 390
> 5 ompi-nightly-trunk 1.4a1r20828 8 0 28 8 22893
> 51 448
> 6 ompi-nightly-trunk 1.4a1r20834 6 0 20 4 11442
> 55 208
> 7 ompi-nightly-trunk 1.4a1r20837 18 0 72 0 34084
> 157 217
> 8 ompi-nightly-trunk 1.4a1r20859 2 0 12 0 11900
> 30 396
> 9 ompi-nightly-trunk 1.4a1r20884 6 0 24 0 11843
> 59 200
> 10 ompi-nightly-v1.3 1.3.1rc5r20730 20 0 71 0 25108
> 252 99
> 11 ompi-nightly-v1.3 1.3.1rc5r20794 5 0 18 0 7332
> 112 65
> 12 ompi-nightly-v1.3 1.3.1rc5r20810 5 0 18 0 6813
> 75 90
> 13 ompi-nightly-v1.3 1.3.1rc5r20826 26 0 96 0 37205
> 3108 11
> 14 ompi-nightly-v1.3 1.3.2a1r20855 1 0 6 0 296
> 107 2
> 15 ompi-nightly-v1.3 1.3.2a1r20880 5 0 18 0 5825
> 143 40
>
> I added that last "pass:fail ratio" column. The run you indicate
> (row #15) indeed
> has a dramatically low pass:fail ratio, but not *THAT* low. E.g.,
> the first 1.3.1
> run we see (row #10) is certainly of the same order of magnitude.
>
> We can compare those two revs in greater detail. I see this:
>
> # Suite np Pass Fail r20730
> 1 ibm 16 0 32
> 2 intel 16 0 123
> 3 iu_ft_cr 16 0 3
> 4 onesided 10 0 16
> 5 onesided 12 0 32
> 6 onesided 14 0 24
> 7 onesided 16 0 22
>
> # Suite np Pass Fail r20880
> 1 ibm 16 0 27
> 2 intel 16 0 38
> 3 iu_ft_cr 16 0 2
> 4 onesided 2 0 10
> 5 onesided 4 0 9
> 6 onesided 6 0 9
> 7 onesided 8 0 9
> 8 onesided 10 0 9
> 9 onesided 12 0 10
> 10 onesided 14 0 10
> 11 onesided 16 0 10
>
> To me, r20880 doesn't particularly look worse than r20730.
>
> We can deep dive on some of these results. I looked at the "ibm
> np=16" and "onesided np=16"
> results a lot. Indeed, r20880 shows lots of seg faults in
> mca_btl_sm.so. On the other hand,
> they don't (so far as I can tell) arise in the add_procs stack.
> Indeed, many aren't in MPI_Init
> at all. Some have to do with librdmacm. In any case, I seem to
> find very much the same stack
> traces for 20730.
>
> I'm still worried that my single-queue code either left a race
> condition in the sm BTL start-up
> or perhaps even made it worse. The recent MTT failures, however,
> don't seem to point to that.
> They seem to point to problems other than the intermittent segv's
> that Jeff and Terry were
> seeing and the data does not seem to me to indicate an increased
> frequency with 1.3.2.
>
> Other opinions welcomed.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems