Pasha and I *think* we have a fix. However, we're not quite clear on
this part of the code, so we need some more testing and eyes on the
code.
I'll start the tests now -- given that this is a low-frequency bug,
I'm going to run a slightly larger MTT run (several thousand tests)
that'll take a few hours (not the ~12 hours that my MTT run took
yesterday) and see if we can get reasonable confidence that we fixed it.
On Jan 15, 2009, at 9:05 AM, Jeff Squyres wrote:
Unfortunately, I have to throw the flag in the v1.3 release. :-(
I ran ~16k tests via MTT yesterday on the rc5 and rc6 tarballs. I
found the following:
Found test runs: 15962
Passed: 15785 (98.89%)
Failed: 83 (0.52%)
--> Openib failures: 80 (0.50%)
Skipped: 46 (0.29%)
Timedout: 48 (0.30%)
The 80 openib failures are all seemingly random segv's. I repeated
a much smaller run this morning (about 700 runs) and still found a
non-zero percentage of fails of the same flavor.
The timeouts are a little worrysome as well.
This unfortunately requires investigation. :-(
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems