Hmmm...Eugene, you need to be a tad less sensitive. Nobody was
attempting to indict you or in any way attack you or your code.
What I was attempting to point out is that there are a number of sm
failures during sm init. I didn't single you out. I posted it to the
community because (a) it is a persistent problem, as you yourself
note, that involves code from a number of people; (b) it is something
users have reported; and (c) it is clearly a race condition, which
means it will be very difficult to chase down.
So please stop the defensive rhetoric - we are not about assigning
blame, but rather about getting the code to work right.
Since we clearly see problems on sif, and Josh has indicated a
willingness to help with debugging, this might be a place to start the
investigation. If asked nicely, they might even be willing to grant
access to the machine, if that would help.
Whether or not we fix this for 1.3.2 is a community decision. At some
point, though, we are going to have to resolve this problem.
Thanks
Ralph
On Mar 26, 2009, at 11:39 PM, Eugene Loh wrote:
Ralph Castain wrote:
You are correct - the Sun errors are in a version prior to the
insertion of the SM changes. We didn't relabel the version to
1.3.2 until -after- those changes went in, so you have to look for
anything with an r number >= 20839.
The sif errors are all in that group - I would suggest starting
there.
I suspect Josh or someone at IU could tell you the compiler. I
would be very surprised if it wasn't gcc, but I don't know what
version. I suspect they could even find a way to run some
debugging on for you, if that would help.
Okay, right now I'm not worried about compiler.
My attorneys advised me not to speak to the public, but I share with
you this
prepared statement. :^)
I don't claim my code is clean. Honestly, there was sm BTL code
that worried
me and I can't claim to have "done no worse" in the changes I made.
But, this
spate of test failures doesn't indict me. (Geez, sorry for being so
defensive.
I guess I just worry myself!)
Let's start with the recent test results you indicated. Say,
http://www.open-mpi.org/mtt/index.php?do_redir=973 which shows these
failures:
143 on IU_Sif
28 on Sun/Linux (row #6 at that URL, I guess, but you said 34?)
3 on Sun/SunOS (row #7)
But, I guess we agreed that the Sun/Linux and Sun/SunOS failures are
with 1.3.1,
and therefore are not attributable to single-queue changes.
So now we look at recent history for IU_Sif. E.g.,
http://www.open-mpi.org/mtt/index.php?do_redir=975
Here is what I see:
# MPI name MPI version MPI install Test build Test run
Pass Fail Pass Fail Pass
Fail pass:fail ratio
1 ompi-nightly-trunk 1.4a1r20771 6 0 24 0 10585
11 962
2 ompi-nightly-trunk 1.4a1r20777 6 0 24 0 11880
20 594
3 ompi-nightly-trunk 1.4a1r20781 12 0 48 0 23759
95 250
4 ompi-nightly-trunk 1.4a1r20793 12 0 48 0 23822
61 390
5 ompi-nightly-trunk 1.4a1r20828 8 0 28 8 22893
51 448
6 ompi-nightly-trunk 1.4a1r20834 6 0 20 4 11442
55 208
7 ompi-nightly-trunk 1.4a1r20837 18 0 72 0 34084
157 217
8 ompi-nightly-trunk 1.4a1r20859 2 0 12 0 11900
30 396
9 ompi-nightly-trunk 1.4a1r20884 6 0 24 0 11843
59 200
10 ompi-nightly-v1.3 1.3.1rc5r20730 20 0 71 0 25108
252 99
11 ompi-nightly-v1.3 1.3.1rc5r20794 5 0 18 0 7332
112 65
12 ompi-nightly-v1.3 1.3.1rc5r20810 5 0 18 0 6813
75 90
13 ompi-nightly-v1.3 1.3.1rc5r20826 26 0 96 0 37205
3108 11
14 ompi-nightly-v1.3 1.3.2a1r20855 1 0 6 0 296
107 2
15 ompi-nightly-v1.3 1.3.2a1r20880 5 0 18 0 5825
143 40
I added that last "pass:fail ratio" column. The run you indicate
(row #15) indeed
has a dramatically low pass:fail ratio, but not *THAT* low. E.g.,
the first 1.3.1
run we see (row #10) is certainly of the same order of magnitude.
We can compare those two revs in greater detail. I see this:
# Suite np Pass Fail r20730
1 ibm 16 0 32
2 intel 16 0 123
3 iu_ft_cr 16 0 3
4 onesided 10 0 16
5 onesided 12 0 32
6 onesided 14 0 24
7 onesided 16 0 22
# Suite np Pass Fail r20880
1 ibm 16 0 27
2 intel 16 0 38
3 iu_ft_cr 16 0 2
4 onesided 2 0 10
5 onesided 4 0 9
6 onesided 6 0 9
7 onesided 8 0 9
8 onesided 10 0 9
9 onesided 12 0 10
10 onesided 14 0 10
11 onesided 16 0 10
To me, r20880 doesn't particularly look worse than r20730.
We can deep dive on some of these results. I looked at the "ibm
np=16" and "onesided np=16"
results a lot. Indeed, r20880 shows lots of seg faults in
mca_btl_sm.so. On the other hand,
they don't (so far as I can tell) arise in the add_procs stack.
Indeed, many aren't in MPI_Init
at all. Some have to do with librdmacm. In any case, I seem to
find very much the same stack
traces for 20730.
I'm still worried that my single-queue code either left a race
condition in the sm BTL start-up
or perhaps even made it worse. The recent MTT failures, however,
don't seem to point to that.
They seem to point to problems other than the intermittent segv's
that Jeff and Terry were
seeing and the data does not seem to me to indicate an increased
frequency with 1.3.2.
Other opinions welcomed.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel