We talked about this a lot today on the call (and then some more afterwards). 
:-)

I think there's 2 important points here.

1. Ralph's original test was written with the intent of launching it with 1 
process which would then do a series of local spawns.  Even doing a huge 
truckload of them, Ralph mentioned (on the phone to me today) that it only took 
about 15 seconds.

2. My test -- i.e., the current one in the ibm test suite directory -- is more 
of a general "beat on the ORTE/spawn system" test.  I.e., just spawn/reap a 
bajillion times and ensure that it works.  I think that it still breaks openib, 
for example -- after you do a bunch of spawns, something runs out of resources 
(I don't remember the exact failure scenario).

----

Ralph's opinion is that we don't need to test for #1 any more.  I don't think 
it would be bad to test for #1 any more, but the C code for such a test could 
be a bit smarter (i.e., only MCW rank 0 could COMM_SPAWN on COMM_SELF, and use 
a host info key of "localhost" to ensure spawning locally, while any other MCW 
procs could idle looping on while (!done) {sleep(1); MPI_Test(..., &done); } so 
that they don't spin the CPU).

For #2, I don't disagree that Eugene's suggestions could make it a bit more 
robust.  After all, we only have so many hours for testing with so much 
equipment; one test that runs for hours and hours probably isn't useful.  You 
can imagine a bunch of ways to make that test more useful: take an argv[1] 
specifying the number of iterations, take an argv[1] that indicates a number of 
seconds to run the test, ensure that you only spawn on half the MCW processes 
and have the other half idle in a while(!done){...} loop, like mentioned above 
so that you can spawn on CPUs that aren't spinning tightly on MPI progress, 
...etc.



On Aug 15, 2011, at 11:47 AM, Eugene Loh wrote:

> This is a question about ompi-tests/ibm/dynamic.  Some of these tests (spawn, 
> spawn_multiple, loop_spawn/child, and no-disconnect) exercise MPI_Comm_spawn* 
> functionality.  Specifically, they spawn additional processes (beyond the 
> initial mpirun launch) and therefore exert a different load on a test system 
> than one might naively expect from the "mpirun -np <np>" command line.
> 
> One approach to testing is to have the test harness know characteristics 
> about individual tests like this.  E.g., if I have only 8 processors and I 
> don't want to oversubscribe, have the test harness know that particular tests 
> should be launched with fewer processes.  On the other hand, building such 
> generality into a test harness when changes would have to be so pervasive 
> (subjective assessment) and so few tests require it may not make that much 
> sense.
> 
> Another approach would be to manage oversubscription in the tests themselves. 
>  E.g., for spawn.c, instead of spawning np new processes, do the following:
> 
> - idle np/2 of the processes
> - have the remaining np/2 processes spawn np/2 new ones
> 
> (Okay, so that leaves open the possibility that the newly spawned processes 
> might not appear on the same nodes where idled processes have "made room" for 
> them.  Each solution seems loaded with shortcomings.)
> 
> Anyhow, I was interested in some feedback on this topic.  A very small number 
> (1-4) of spawning tests are causing us lots of problems (undue complexity in 
> the test harness as well as a bunch of our time for reasons I find difficult 
> to explain succinctly).  We're inclined to modify the tests so that they're a 
> little more social.  E.g., make decisions about how many of the launched 
> processes should "really" be used, idling some fraction of the processes, and 
> continuing the test only with the remaining fraction.
> 
> Comments?
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to