We talked about this a lot today on the call (and then some more afterwards).
:-)
I think there's 2 important points here.
1. Ralph's original test was written with the intent of launching it with 1
process which would then do a series of local spawns. Even doing a huge
truckload of them, Ralph mentioned (on the phone to me today) that it only took
about 15 seconds.
2. My test -- i.e., the current one in the ibm test suite directory -- is more
of a general "beat on the ORTE/spawn system" test. I.e., just spawn/reap a
bajillion times and ensure that it works. I think that it still breaks openib,
for example -- after you do a bunch of spawns, something runs out of resources
(I don't remember the exact failure scenario).
----
Ralph's opinion is that we don't need to test for #1 any more. I don't think
it would be bad to test for #1 any more, but the C code for such a test could
be a bit smarter (i.e., only MCW rank 0 could COMM_SPAWN on COMM_SELF, and use
a host info key of "localhost" to ensure spawning locally, while any other MCW
procs could idle looping on while (!done) {sleep(1); MPI_Test(..., &done); } so
that they don't spin the CPU).
For #2, I don't disagree that Eugene's suggestions could make it a bit more
robust. After all, we only have so many hours for testing with so much
equipment; one test that runs for hours and hours probably isn't useful. You
can imagine a bunch of ways to make that test more useful: take an argv[1]
specifying the number of iterations, take an argv[1] that indicates a number of
seconds to run the test, ensure that you only spawn on half the MCW processes
and have the other half idle in a while(!done){...} loop, like mentioned above
so that you can spawn on CPUs that aren't spinning tightly on MPI progress,
...etc.
On Aug 15, 2011, at 11:47 AM, Eugene Loh wrote:
> This is a question about ompi-tests/ibm/dynamic. Some of these tests (spawn,
> spawn_multiple, loop_spawn/child, and no-disconnect) exercise MPI_Comm_spawn*
> functionality. Specifically, they spawn additional processes (beyond the
> initial mpirun launch) and therefore exert a different load on a test system
> than one might naively expect from the "mpirun -np <np>" command line.
>
> One approach to testing is to have the test harness know characteristics
> about individual tests like this. E.g., if I have only 8 processors and I
> don't want to oversubscribe, have the test harness know that particular tests
> should be launched with fewer processes. On the other hand, building such
> generality into a test harness when changes would have to be so pervasive
> (subjective assessment) and so few tests require it may not make that much
> sense.
>
> Another approach would be to manage oversubscription in the tests themselves.
> E.g., for spawn.c, instead of spawning np new processes, do the following:
>
> - idle np/2 of the processes
> - have the remaining np/2 processes spawn np/2 new ones
>
> (Okay, so that leaves open the possibility that the newly spawned processes
> might not appear on the same nodes where idled processes have "made room" for
> them. Each solution seems loaded with shortcomings.)
>
> Anyhow, I was interested in some feedback on this topic. A very small number
> (1-4) of spawning tests are causing us lots of problems (undue complexity in
> the test harness as well as a bunch of our time for reasons I find difficult
> to explain succinctly). We're inclined to modify the tests so that they're a
> little more social. E.g., make decisions about how many of the launched
> processes should "really" be used, idling some fraction of the processes, and
> continuing the test only with the remaining fraction.
>
> Comments?
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/