Hmmm...Eugene, you need to be a tad less sensitive. Nobody was attempting to indict you or in any way attack you or your code.

What I was attempting to point out is that there are a number of sm failures during sm init. I didn't single you out. I posted it to the community because (a) it is a persistent problem, as you yourself note, that involves code from a number of people; (b) it is something users have reported; and (c) it is clearly a race condition, which means it will be very difficult to chase down.

So please stop the defensive rhetoric - we are not about assigning blame, but rather about getting the code to work right.

Since we clearly see problems on sif, and Josh has indicated a willingness to help with debugging, this might be a place to start the investigation. If asked nicely, they might even be willing to grant access to the machine, if that would help.

Whether or not we fix this for 1.3.2 is a community decision. At some point, though, we are going to have to resolve this problem.

Thanks
Ralph

On Mar 26, 2009, at 11:39 PM, Eugene Loh wrote:

Ralph Castain wrote:

You are correct - the Sun errors are in a version prior to the insertion of the SM changes. We didn't relabel the version to 1.3.2 until -after- those changes went in, so you have to look for anything with an r number >= 20839.

The sif errors are all in that group - I would suggest starting there.

I suspect Josh or someone at IU could tell you the compiler. I would be very surprised if it wasn't gcc, but I don't know what version. I suspect they could even find a way to run some debugging on for you, if that would help.

Okay, right now I'm not worried about compiler.

My attorneys advised me not to speak to the public, but I share with you this
prepared statement.  :^)

I don't claim my code is clean. Honestly, there was sm BTL code that worried me and I can't claim to have "done no worse" in the changes I made. But, this spate of test failures doesn't indict me. (Geez, sorry for being so defensive.
I guess I just worry myself!)

Let's start with the recent test results you indicated.  Say,
http://www.open-mpi.org/mtt/index.php?do_redir=973 which shows these failures:

143 on IU_Sif
 28 on Sun/Linux (row #6 at that URL, I guess, but you said 34?)
  3 on Sun/SunOS (row #7)

But, I guess we agreed that the Sun/Linux and Sun/SunOS failures are with 1.3.1,
and therefore are not attributable to single-queue changes.

So now we look at recent history for IU_Sif.  E.g.,
http://www.open-mpi.org/mtt/index.php?do_redir=975
Here is what I see:

 # MPI name           MPI version    MPI install Test build Test run
Pass Fail Pass Fail Pass Fail pass:fail ratio 1 ompi-nightly-trunk 1.4a1r20771 6 0 24 0 10585 11 962 2 ompi-nightly-trunk 1.4a1r20777 6 0 24 0 11880 20 594 3 ompi-nightly-trunk 1.4a1r20781 12 0 48 0 23759 95 250 4 ompi-nightly-trunk 1.4a1r20793 12 0 48 0 23822 61 390 5 ompi-nightly-trunk 1.4a1r20828 8 0 28 8 22893 51 448 6 ompi-nightly-trunk 1.4a1r20834 6 0 20 4 11442 55 208 7 ompi-nightly-trunk 1.4a1r20837 18 0 72 0 34084 157 217 8 ompi-nightly-trunk 1.4a1r20859 2 0 12 0 11900 30 396 9 ompi-nightly-trunk 1.4a1r20884 6 0 24 0 11843 59 200 10 ompi-nightly-v1.3 1.3.1rc5r20730 20 0 71 0 25108 252 99 11 ompi-nightly-v1.3 1.3.1rc5r20794 5 0 18 0 7332 112 65 12 ompi-nightly-v1.3 1.3.1rc5r20810 5 0 18 0 6813 75 90 13 ompi-nightly-v1.3 1.3.1rc5r20826 26 0 96 0 37205 3108 11 14 ompi-nightly-v1.3 1.3.2a1r20855 1 0 6 0 296 107 2 15 ompi-nightly-v1.3 1.3.2a1r20880 5 0 18 0 5825 143 40

I added that last "pass:fail ratio" column. The run you indicate (row #15) indeed has a dramatically low pass:fail ratio, but not *THAT* low. E.g., the first 1.3.1
run we see (row #10) is certainly of the same order of magnitude.

We can compare those two revs in greater detail.  I see this:

 # Suite    np Pass Fail       r20730
 1 ibm      16   0   32
 2 intel    16   0  123
 3 iu_ft_cr 16   0    3
 4 onesided 10   0   16
 5 onesided 12   0   32
 6 onesided 14   0   24
 7 onesided 16   0   22

 # Suite    np Pass Fail       r20880
 1 ibm      16   0   27
 2 intel    16   0   38
 3 iu_ft_cr 16   0    2
 4 onesided  2   0   10
 5 onesided  4   0    9
 6 onesided  6   0    9
 7 onesided  8   0    9
 8 onesided 10   0    9
 9 onesided 12   0   10
10 onesided 14   0   10
11 onesided 16   0   10

To me, r20880 doesn't particularly look worse than r20730.

We can deep dive on some of these results. I looked at the "ibm np=16" and "onesided np=16" results a lot. Indeed, r20880 shows lots of seg faults in mca_btl_sm.so. On the other hand, they don't (so far as I can tell) arise in the add_procs stack. Indeed, many aren't in MPI_Init at all. Some have to do with librdmacm. In any case, I seem to find very much the same stack
traces for 20730.

I'm still worried that my single-queue code either left a race condition in the sm BTL start-up or perhaps even made it worse. The recent MTT failures, however, don't seem to point to that. They seem to point to problems other than the intermittent segv's that Jeff and Terry were seeing and the data does not seem to me to indicate an increased frequency with 1.3.2.

Other opinions welcomed.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to