We talked about this a lot today on the call (and then some more afterwards). :-)
I think there's 2 important points here. 1. Ralph's original test was written with the intent of launching it with 1 process which would then do a series of local spawns. Even doing a huge truckload of them, Ralph mentioned (on the phone to me today) that it only took about 15 seconds. 2. My test -- i.e., the current one in the ibm test suite directory -- is more of a general "beat on the ORTE/spawn system" test. I.e., just spawn/reap a bajillion times and ensure that it works. I think that it still breaks openib, for example -- after you do a bunch of spawns, something runs out of resources (I don't remember the exact failure scenario). ---- Ralph's opinion is that we don't need to test for #1 any more. I don't think it would be bad to test for #1 any more, but the C code for such a test could be a bit smarter (i.e., only MCW rank 0 could COMM_SPAWN on COMM_SELF, and use a host info key of "localhost" to ensure spawning locally, while any other MCW procs could idle looping on while (!done) {sleep(1); MPI_Test(..., &done); } so that they don't spin the CPU). For #2, I don't disagree that Eugene's suggestions could make it a bit more robust. After all, we only have so many hours for testing with so much equipment; one test that runs for hours and hours probably isn't useful. You can imagine a bunch of ways to make that test more useful: take an argv[1] specifying the number of iterations, take an argv[1] that indicates a number of seconds to run the test, ensure that you only spawn on half the MCW processes and have the other half idle in a while(!done){...} loop, like mentioned above so that you can spawn on CPUs that aren't spinning tightly on MPI progress, ...etc. On Aug 15, 2011, at 11:47 AM, Eugene Loh wrote: > This is a question about ompi-tests/ibm/dynamic. Some of these tests (spawn, > spawn_multiple, loop_spawn/child, and no-disconnect) exercise MPI_Comm_spawn* > functionality. Specifically, they spawn additional processes (beyond the > initial mpirun launch) and therefore exert a different load on a test system > than one might naively expect from the "mpirun -np <np>" command line. > > One approach to testing is to have the test harness know characteristics > about individual tests like this. E.g., if I have only 8 processors and I > don't want to oversubscribe, have the test harness know that particular tests > should be launched with fewer processes. On the other hand, building such > generality into a test harness when changes would have to be so pervasive > (subjective assessment) and so few tests require it may not make that much > sense. > > Another approach would be to manage oversubscription in the tests themselves. > E.g., for spawn.c, instead of spawning np new processes, do the following: > > - idle np/2 of the processes > - have the remaining np/2 processes spawn np/2 new ones > > (Okay, so that leaves open the possibility that the newly spawned processes > might not appear on the same nodes where idled processes have "made room" for > them. Each solution seems loaded with shortcomings.) > > Anyhow, I was interested in some feedback on this topic. A very small number > (1-4) of spawning tests are causing us lots of problems (undue complexity in > the test harness as well as a bunch of our time for reasons I find difficult > to explain succinctly). We're inclined to modify the tests so that they're a > little more social. E.g., make decisions about how many of the launched > processes should "really" be used, idling some fraction of the processes, and > continuing the test only with the remaining fraction. > > Comments? > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/