[OMPI devel] udcm_component_query hangs when memlock not infinite
Hello, We're setting up a new cluster here. Open MPI 1.7.4 was hanging at startup without any error message. The issue appears to be udcm_component_query() hanging in finalize() on the sched_yield() loop when memlock limit isn't set to unlimited as usual. Unfortunately the hangs occur before we print the usual error message "you need to set memlock limit to unlimited". If the udcm problem cannot be fixed, it would be good to print an error message about memlock not being unlimited much earlier. Brice
[OMPI devel] compile error in trunk
*Hi,* *This commit caused the failure:* 1. Comments about 'db' arguments. 2. Fixes #4205: ensure sizeof(MPI_Count) <= sizeof(size_t) *13:28:24* CC ompi_datatype_args.lo*13:28:24* In file included from ../../ompi/datatype/ompi_datatype.h:43,*13:28:24* from ompi_datatype_args.c:33:*13:28:24* ../../ompi/include/mpi.h:324: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:325: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:326: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'MPI_Count'*13:28:24* ../../ompi/include/mpi.h:374: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:376: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* In file included from ../../ompi/datatype/ompi_datatype.h:43,*13:28:24* from ompi_datatype_args.c:33:*13:28:24* ../../ompi/include/mpi.h:1159: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1164: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1178: error: expected ')' before 'size'*13:28:24* ../../ompi/include/mpi.h:1331: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1332: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1333: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1338: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1340: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1343: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1345: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1347: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1349: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1351: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1353: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1367: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1368: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1369: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1370: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1383: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1384: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1385: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1388: error: expected declaration specifiers or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1404: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1424: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1427: error: expected declaration specifiers or '...' before 'MPI_Count'*13:28:24* ../../ompi/include/mpi.h:1430: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1550: warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1550: error: expected ';', ',' or ')' before 'sdispls'*13:28:24* ../../ompi/include/mpi.h:1553: warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1553: error: expected ';', ',' or ')' before 'sdispls'*13:28:24* ../../ompi/include/mpi.h:1564: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1564: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1566: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1576: error: expected declaration specifiers or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1653: error: expected declaration specifiers or '...' before 'MPI_Count'*13:28:24* ../../ompi/include/mpi.h:1677: warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1677: error: expected ';', ',' or ')' before 'array_of_displacements'*13:28:24* ../../ompi/include/mpi.h:1681: warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24*
Re: [OMPI devel] compile error in trunk
Was just fixed in https://svn.open-mpi.org/trac/ompi/changeset/30780. On Feb 20, 2014, at 7:11 AM, Mike Dubman wrote: > Hi, > This commit caused the failure: > • Comments about 'db' arguments. > • Fixes #4205: ensure sizeof(MPI_Count) <= sizeof(size_t) > > 13:28:24 > CC ompi_datatype_args.lo > > 13:28:24 > In file included from ../../ompi/datatype/ompi_datatype.h:43, > > 13:28:24 > from ompi_datatype_args.c:33: > > 13:28:24 > ../../ompi/include/mpi.h:324: error: expected '=', ',', ';', 'asm' or > '__attribute__' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:325: error: expected '=', ',', ';', 'asm' or > '__attribute__' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:326: error: expected '=', ',', ';', 'asm' or > '__attribute__' before 'MPI_Count' > > 13:28:24 > ../../ompi/include/mpi.h:374: error: expected declaration specifiers or '...' > before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:376: error: expected declaration specifiers or '...' > before 'MPI_Offset' > > 13:28:24 > In file included from ../../ompi/datatype/ompi_datatype.h:43, > > 13:28:24 > from ompi_datatype_args.c:33: > > 13:28:24 > ../../ompi/include/mpi.h:1159: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1164: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1178: error: expected ')' before 'size' > > 13:28:24 > ../../ompi/include/mpi.h:1331: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1332: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1333: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1338: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1340: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1343: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1345: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1347: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1349: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1351: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1353: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1367: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1368: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1369: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1370: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1383: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1384: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1385: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1388: error: expected declaration specifiers or > '...' before 'MPI_Offset' > > 13:28:24 > ../../ompi/include/mpi.h:1404: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1424: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1427: error: expected declaration specifiers or > '...' before 'MPI_Count' > > 13:28:24 > ../../ompi/include/mpi.h:1430: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1550: warning: type defaults to 'int' in declaration > of 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1550: error: expected ';', ',' or ')' before > 'sdispls' > > 13:28:24 > ../../ompi/include/mpi.h:1553: warning: type defaults to 'int' in declaration > of 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1553: error: expected ';', ',' or ')' before > 'sdispls' > > 13:28:24 > ../../ompi/include/mpi.h:1564: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1564: error: expected declaration specifiers or > '...' before 'MPI_Aint' > > 13:28:24 > ../../ompi/include/mpi.h:1566: error: expected declaration spe
[OMPI devel] oshmem tests
I took the liberty of committing the openshmem test suite 1.0d to the ompi-tests SVN, mainly because there are some post-release patches that are necessary to get it to compile/run properly. Mellanox put some clever workarounds in the MTT ini file for a first round of patches, but I'm finding more patches are necessary (e.g., to declare time() and srand()). So let's just commit it, ensure all the patches are send upstream, and see if 1.0e will contain all the fixes. Mellanox: did you send all your existing patches upstream already? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] oshmem test suite errors
For all of these, I'm using the openshmem test suite that is now committed to the ompi-svn SVN repo. I don't know if the errors are with the tests or with oshmem itself. 1. I'm running the oshmem test suite at 32 processes across 2 16-core servers. I'm seeing a segv in "examples/shmem_2dheat.x 10 10". It seems to run fine at lower np values such as 2, 4, and 8; I didn't try to determine where the crossover to badness occurs. 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and usnic BTLs, even when running at np=2 (I let it run for several minutes before killing it). 3. Ditto for "example/ptp.x 10 10". 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- perhaps it would have finished eventually?). ...there's more results (more timeouts and more failures), but they're not yet complete, and I've got to keep working on my own features for v1.7.5, so I need to move to other things right now. I think I have oshmem running well enough to add these to Cisco's nightly MTT runs now, so the results will start showing up there without needing my manual attention. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] udcm_component_query hangs when memlock not infinite
Thanks for the report; I filed https://svn.open-mpi.org/trac/ompi/ticket/4290. On Feb 20, 2014, at 4:34 AM, Brice Goglin wrote: > Hello, > > We're setting up a new cluster here. Open MPI 1.7.4 was hanging at > startup without any error message. The issue appears to be > udcm_component_query() hanging in finalize() on the sched_yield() loop > when memlock limit isn't set to unlimited as usual. > > Unfortunately the hangs occur before we print the usual error message > "you need to set memlock limit to unlimited". If the udcm problem cannot > be fixed, it would be good to print an error message about memlock not > being unlimited much earlier. > > Brice > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] oshmem test suite errors
Could you send along the relevant mtt .ini sections? On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres) wrote: > For all of these, I'm using the openshmem test suite that is now committed to > the ompi-svn SVN repo. I don't know if the errors are with the tests or with > oshmem itself. > > 1. I'm running the oshmem test suite at 32 processes across 2 16-core > servers. I'm seeing a segv in "examples/shmem_2dheat.x 10 10". It seems to > run fine at lower np values such as 2, 4, and 8; I didn't try to determine > where the crossover to badness occurs. > > 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and > usnic BTLs, even when running at np=2 (I let it run for several minutes > before killing it). > > 3. Ditto for "example/ptp.x 10 10". > > 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but > hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- > perhaps it would have finished eventually?). > > ...there's more results (more timeouts and more failures), but they're not > yet complete, and I've got to keep working on my own features for v1.7.5, so > I need to move to other things right now. > > I think I have oshmem running well enough to add these to Cisco's nightly MTT > runs now, so the results will start showing up there without needing my > manual attention. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] oshmem test suite errors
Yes, I've added them to my Cisco MTT ini files in the ompi-svn repo. Look in cisco/mtt/usnic/usnic-trunk.ini and usnic-v1.7.ini. All relevant sections have "oshmem" in them. Most are copied from the Mellanox examples, but I made a few tweaks/improvements here and there. I also anticipate adjusting some of the timeouts as we get a few MTT oshmem runs done in some of the sections for some longer-running tests (at np=32 and possibly 64). On Feb 20, 2014, at 10:34 AM, Ralph Castain wrote: > Could you send along the relevant mtt .ini sections? > > > On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres) > wrote: > >> For all of these, I'm using the openshmem test suite that is now committed >> to the ompi-svn SVN repo. I don't know if the errors are with the tests or >> with oshmem itself. >> >> 1. I'm running the oshmem test suite at 32 processes across 2 16-core >> servers. I'm seeing a segv in "examples/shmem_2dheat.x 10 10". It seems to >> run fine at lower np values such as 2, 4, and 8; I didn't try to determine >> where the crossover to badness occurs. >> >> 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and >> usnic BTLs, even when running at np=2 (I let it run for several minutes >> before killing it). >> >> 3. Ditto for "example/ptp.x 10 10". >> >> 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but >> hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- >> perhaps it would have finished eventually?). >> >> ...there's more results (more timeouts and more failures), but they're not >> yet complete, and I've got to keep working on my own features for v1.7.5, so >> I need to move to other things right now. >> >> I think I have oshmem running well enough to add these to Cisco's nightly >> MTT runs now, so the results will start showing up there without needing my >> manual attention. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] oshmem test suite errors
On Feb 20, 2014, at 10:44 AM, "Jeff Squyres (jsquyres)" wrote: > Yes, I've added them to my Cisco MTT ini files in the ompi-svn repo. Err... I meant ompi-tests SVN repo. :-) > Look in cisco/mtt/usnic/usnic-trunk.ini and usnic-v1.7.ini. > > All relevant sections have "oshmem" in them. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI devel] MPI_Comm_spawn under Torque
Hello! I am having problem using MPI_Comm_spawn under torque. It doesn't work when spawning more than 12 processes on various nodes. To be more precise, "sometimes" it works, and "sometimes" it doesn't! Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks like below. node1 node1 node1 node2 node2 node2 node3 node3 node3 node4 node4 node4 node5 node5 node5 I started a hello program (which just spawns itself and of course, the children don't spawn), with mpiexec -np 3 ./hello Spawning 3 more processes (on node 2) - works! spawning 6 more processes (node 2 and 3) - works! spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! spawning 12 processes (node 2,3,4,5) - "mostly" not! I ideally want to spawn about 32 processes with large number of nodes, but this is at the moment impossible. I have attached my hello program to this email. I will be happy to provide any more info or verbose outputs if you could please tell me what exactly you would like to see. Best, Suraj hello.c Description: Binary data
Re: [OMPI devel] oshmem test suite errors
On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres) wrote: > For all of these, I'm using the openshmem test suite that is now committed to > the ompi-svn SVN repo. I don't know if the errors are with the tests or with > oshmem itself. > > 1. I'm running the oshmem test suite at 32 processes across 2 16-core > servers. I'm seeing a segv in "examples/shmem_2dheat.x 10 10". It seems to > run fine at lower np values such as 2, 4, and 8; I didn't try to determine > where the crossover to badness occurs. My memory is bad and my notes are on a machine I no longer have access to, but I did this to the test suite run for Portals SHMEM: Index: shmem_2dheat.c === --- shmem_2dheat.c (revision 270) +++ shmem_2dheat.c (revision 271) @@ -129,6 +129,11 @@ p = _num_pes (); my_rank = _my_pe (); + if (p > 8) { + fprintf(stderr, "Ignoring test when run with more than 8 pes\n"); + return 77; + } + /* argument processing done by everyone */ int c, errflg; extern char *optarg; The commit comment was that there was a scaling issue in the code itself, I just wish I could remember exactly what it was. > 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and > usnic BTLs, even when running at np=2 (I let it run for several minutes > before killing it). If atomics aren't fast, this test can run for a very long time (also, it takes no arguments, so the 10 10 is being ignored). It's essentially looking for a race by blasting 32-bit atomic ops at both parts of a 64 bit word. > 3. Ditto for "example/ptp.x 10 10". > > 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but > hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- > perhaps it would have finished eventually?). > > ...there's more results (more timeouts and more failures), but they're not > yet complete, and I've got to keep working on my own features for v1.7.5, so > I need to move to other things right now. These start to sound like issues in the code; those last two are pretty decent tests. > I think I have oshmem running well enough to add these to Cisco's nightly MTT > runs now, so the results will start showing up there without needing my > manual attention. Woot. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
Re: [OMPI devel] MPI_Comm_spawn under Torque
What OMPI version are you using? On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran wrote: > Hello! > > I am having problem using MPI_Comm_spawn under torque. It doesn't work when > spawning more than 12 processes on various nodes. To be more precise, > "sometimes" it works, and "sometimes" it doesn't! > > Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE > looks like below. > > node1 > node1 > node1 > node2 > node2 > node2 > node3 > node3 > node3 > node4 > node4 > node4 > node5 > node5 > node5 > > I started a hello program (which just spawns itself and of course, the > children don't spawn), with > > mpiexec -np 3 ./hello > > Spawning 3 more processes (on node 2) - works! > spawning 6 more processes (node 2 and 3) - works! > spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! > spawning 12 processes (node 2,3,4,5) - "mostly" not! > > I ideally want to spawn about 32 processes with large number of nodes, but > this is at the moment impossible. I have attached my hello program to this > email. > > I will be happy to provide any more info or verbose outputs if you could > please tell me what exactly you would like to see. > > Best, > Suraj > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] MPI_Comm_spawn under Torque
I am using 1.7.4! On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: > What OMPI version are you using? > > On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran > wrote: > >> Hello! >> >> I am having problem using MPI_Comm_spawn under torque. It doesn't work when >> spawning more than 12 processes on various nodes. To be more precise, >> "sometimes" it works, and "sometimes" it doesn't! >> >> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE >> looks like below. >> >> node1 >> node1 >> node1 >> node2 >> node2 >> node2 >> node3 >> node3 >> node3 >> node4 >> node4 >> node4 >> node5 >> node5 >> node5 >> >> I started a hello program (which just spawns itself and of course, the >> children don't spawn), with >> >> mpiexec -np 3 ./hello >> >> Spawning 3 more processes (on node 2) - works! >> spawning 6 more processes (node 2 and 3) - works! >> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >> spawning 12 processes (node 2,3,4,5) - "mostly" not! >> >> I ideally want to spawn about 32 processes with large number of nodes, but >> this is at the moment impossible. I have attached my hello program to this >> email. >> >> I will be happy to provide any more info or verbose outputs if you could >> please tell me what exactly you would like to see. >> >> Best, >> Suraj >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Please remove me from this distribution
Please take me off distribution. -Original Message- From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Suraj Prabhakaran Sent: Thursday, February 20, 2014 1:14 PM To: Open MPI Developers Subject: Re: [OMPI devel] MPI_Comm_spawn under Torque I am using 1.7.4! On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: > What OMPI version are you using? > > On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran wrote: > >> Hello! >> >> I am having problem using MPI_Comm_spawn under torque. It doesn't work when spawning more than 12 processes on various nodes. To be more precise, "sometimes" it works, and "sometimes" it doesn't! >> >> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks like below. >> >> node1 >> node1 >> node1 >> node2 >> node2 >> node2 >> node3 >> node3 >> node3 >> node4 >> node4 >> node4 >> node5 >> node5 >> node5 >> >> I started a hello program (which just spawns itself and of course, the children don't spawn), with >> >> mpiexec -np 3 ./hello >> >> Spawning 3 more processes (on node 2) - works! >> spawning 6 more processes (node 2 and 3) - works! >> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >> spawning 12 processes (node 2,3,4,5) - "mostly" not! >> >> I ideally want to spawn about 32 processes with large number of nodes, but this is at the moment impossible. I have attached my hello program to this email. >> >> I will be happy to provide any more info or verbose outputs if you could please tell me what exactly you would like to see. >> >> Best, >> Suraj >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Please remove me from this distribution
On 15:45 Thu 20 Feb , julia.dudascik.contrac...@unnpp.gov wrote: > Please take me off distribution. http://www.open-mpi.org/mailman/listinfo.cgi/devel HTH -Andreas -- == Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org == (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination! signature.asc Description: Digital signature
Re: [OMPI devel] MPI_Comm_spawn under Torque
Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't work"? Is there some specific behavior you see? You might try the attached program. It's a simple spawn test we use - 1.7.4 seems happy with it. simple_spawn.c Description: Binary data On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran wrote: > I am using 1.7.4! > > On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: > >> What OMPI version are you using? >> >> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >> wrote: >> >>> Hello! >>> >>> I am having problem using MPI_Comm_spawn under torque. It doesn't work when >>> spawning more than 12 processes on various nodes. To be more precise, >>> "sometimes" it works, and "sometimes" it doesn't! >>> >>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE >>> looks like below. >>> >>> node1 >>> node1 >>> node1 >>> node2 >>> node2 >>> node2 >>> node3 >>> node3 >>> node3 >>> node4 >>> node4 >>> node4 >>> node5 >>> node5 >>> node5 >>> >>> I started a hello program (which just spawns itself and of course, the >>> children don't spawn), with >>> >>> mpiexec -np 3 ./hello >>> >>> Spawning 3 more processes (on node 2) - works! >>> spawning 6 more processes (node 2 and 3) - works! >>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! >>> spawning 12 processes (node 2,3,4,5) - "mostly" not! >>> >>> I ideally want to spawn about 32 processes with large number of nodes, but >>> this is at the moment impossible. I have attached my hello program to this >>> email. >>> >>> I will be happy to provide any more info or verbose outputs if you could >>> please tell me what exactly you would like to see. >>> >>> Best, >>> Suraj >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] MPI_Comm_spawn under Torque
Thanks Ralph! I must have mentioned though. Without the Torque environment, spawning with ssh works ok. But Under the torque environment, not. I started the simple_spawn with 3 processes and spawned 9 processes (3 per node on 3 nodes). There is no problem with the Torque environment because all the 9 processes are started on the respective nodes. But the MPI_Comm_spawn of the parent and MPI_Init of the children, "sometimes" don't return! This is the output of simple_spawn - which confirms the above statement. [pid 31208] starting up! [pid 31209] starting up! [pid 31210] starting up! 0 completed MPI_Init Parent [pid 31208] about to spawn! 1 completed MPI_Init Parent [pid 31209] about to spawn! 2 completed MPI_Init Parent [pid 31210] about to spawn! [pid 28630] starting up! [pid 28631] starting up! [pid 9846] starting up! [pid 9847] starting up! [pid 9848] starting up! [pid 6363] starting up! [pid 6361] starting up! [pid 6362] starting up! [pid 28632] starting up! Any hints? Best, Suraj On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: > Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't > work"? Is there some specific behavior you see? > > You might try the attached program. It's a simple spawn test we use - 1.7.4 > seems happy with it. > > > > On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran > wrote: > >> I am using 1.7.4! >> >> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >> >>> What OMPI version are you using? >>> >>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran >>> wrote: >>> Hello! I am having problem using MPI_Comm_spawn under torque. It doesn't work when spawning more than 12 processes on various nodes. To be more precise, "sometimes" it works, and "sometimes" it doesn't! Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks like below. node1 node1 node1 node2 node2 node2 node3 node3 node3 node4 node4 node4 node5 node5 node5 I started a hello program (which just spawns itself and of course, the children don't spawn), with mpiexec -np 3 ./hello Spawning 3 more processes (on node 2) - works! spawning 6 more processes (node 2 and 3) - works! spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! spawning 12 processes (node 2,3,4,5) - "mostly" not! I ideally want to spawn about 32 processes with large number of nodes, but this is at the moment impossible. I have attached my hello program to this email. I will be happy to provide any more info or verbose outputs if you could please tell me what exactly you would like to see. Best, Suraj ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] MPI_Comm_spawn under Torque
On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran wrote: > Thanks Ralph! > > I must have mentioned though. Without the Torque environment, spawning with > ssh works ok. But Under the torque environment, not. Ah, no - you forgot to mention that point. > > I started the simple_spawn with 3 processes and spawned 9 processes (3 per > node on 3 nodes). > > There is no problem with the Torque environment because all the 9 processes > are started on the respective nodes. But the MPI_Comm_spawn of the parent and > MPI_Init of the children, "sometimes" don't return! Seems odd - the launch environment has nothing to do with MPI_Init, so if the processes are indeed being started, they should run. One possibility is that they aren't correctly getting some wireup info. Can you configure OMPI --enable-debug and then rerun the example with "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on the command line? > > This is the output of simple_spawn - which confirms the above statement. > > [pid 31208] starting up! > [pid 31209] starting up! > [pid 31210] starting up! > 0 completed MPI_Init > Parent [pid 31208] about to spawn! > 1 completed MPI_Init > Parent [pid 31209] about to spawn! > 2 completed MPI_Init > Parent [pid 31210] about to spawn! > [pid 28630] starting up! > [pid 28631] starting up! > [pid 9846] starting up! > [pid 9847] starting up! > [pid 9848] starting up! > [pid 6363] starting up! > [pid 6361] starting up! > [pid 6362] starting up! > [pid 28632] starting up! > > Any hints? > > Best, > Suraj > > On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote: > >> Hmmm...I don't see anything immediately glaring. What do you mean by >> "doesn't work"? Is there some specific behavior you see? >> >> You might try the attached program. It's a simple spawn test we use - 1.7.4 >> seems happy with it. >> >> >> >> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran >> wrote: >> >>> I am using 1.7.4! >>> >>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: >>> What OMPI version are you using? On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran wrote: > Hello! > > I am having problem using MPI_Comm_spawn under torque. It doesn't work > when spawning more than 12 processes on various nodes. To be more > precise, "sometimes" it works, and "sometimes" it doesn't! > > Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE > looks like below. > > node1 > node1 > node1 > node2 > node2 > node2 > node3 > node3 > node3 > node4 > node4 > node4 > node5 > node5 > node5 > > I started a hello program (which just spawns itself and of course, the > children don't spawn), with > > mpiexec -np 3 ./hello > > Spawning 3 more processes (on node 2) - works! > spawning 6 more processes (node 2 and 3) - works! > spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not! > spawning 12 processes (node 2,3,4,5) - "mostly" not! > > I ideally want to spawn about 32 processes with large number of nodes, > but this is at the moment impossible. I have attached my hello program to > this email. > > I will be happy to provide any more info or verbose outputs if you could > please tell me what exactly you would like to see. > > Best, > Suraj > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel