[OMPI devel] udcm_component_query hangs when memlock not infinite

2014-02-20 Thread Brice Goglin
Hello,

We're setting up a new cluster here. Open MPI 1.7.4 was hanging at
startup without any error message. The issue appears to be
udcm_component_query() hanging in finalize() on the sched_yield() loop
when memlock limit isn't set to unlimited as usual.

Unfortunately the hangs occur before we print the usual error message
"you need to set memlock limit to unlimited". If the udcm problem cannot
be fixed, it would be good to print an error message about memlock not
being unlimited much earlier.

Brice



[OMPI devel] compile error in trunk

2014-02-20 Thread Mike Dubman
*Hi,*

*This commit caused the failure:*


   1. Comments about 'db' arguments.
   2. Fixes #4205: ensure sizeof(MPI_Count) <= sizeof(size_t)


*13:28:24*   CC   ompi_datatype_args.lo*13:28:24* In file included
from ../../ompi/datatype/ompi_datatype.h:43,*13:28:24*
 from ompi_datatype_args.c:33:*13:28:24* ../../ompi/include/mpi.h:324:
error: expected '=', ',', ';', 'asm' or '__attribute__' before
'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:325: error: expected
'=', ',', ';', 'asm' or '__attribute__' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:326: error: expected '=', ',', ';', 'asm' or
'__attribute__' before 'MPI_Count'*13:28:24*
../../ompi/include/mpi.h:374: error: expected declaration specifiers
or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:376:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* In file included from
../../ompi/datatype/ompi_datatype.h:43,*13:28:24*
from ompi_datatype_args.c:33:*13:28:24* ../../ompi/include/mpi.h:1159:
error: expected declaration specifiers or '...' before
'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1164: error: expected
declaration specifiers or '...' before 'MPI_Aint'*13:28:24*
../../ompi/include/mpi.h:1178: error: expected ')' before
'size'*13:28:24* ../../ompi/include/mpi.h:1331: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1332: error: expected declaration specifiers
or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1333:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1338: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1340: error: expected declaration specifiers
or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1343:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1345: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1347: error: expected declaration specifiers
or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1349:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1351: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1353: error: expected declaration specifiers
or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1367:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1368: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1369: error: expected declaration specifiers
or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1370:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1383: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1384: error: expected declaration specifiers
or '...' before 'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1385:
error: expected declaration specifiers or '...' before
'MPI_Offset'*13:28:24* ../../ompi/include/mpi.h:1388: error: expected
declaration specifiers or '...' before 'MPI_Offset'*13:28:24*
../../ompi/include/mpi.h:1404: error: expected declaration specifiers
or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1424:
error: expected declaration specifiers or '...' before
'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1427: error: expected
declaration specifiers or '...' before 'MPI_Count'*13:28:24*
../../ompi/include/mpi.h:1430: error: expected declaration specifiers
or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1550:
warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24*
../../ompi/include/mpi.h:1550: error: expected ';', ',' or ')' before
'sdispls'*13:28:24* ../../ompi/include/mpi.h:1553: warning: type
defaults to 'int' in declaration of 'MPI_Aint'*13:28:24*
../../ompi/include/mpi.h:1553: error: expected ';', ',' or ')' before
'sdispls'*13:28:24* ../../ompi/include/mpi.h:1564: error: expected
declaration specifiers or '...' before 'MPI_Aint'*13:28:24*
../../ompi/include/mpi.h:1564: error: expected declaration specifiers
or '...' before 'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1566:
error: expected declaration specifiers or '...' before
'MPI_Aint'*13:28:24* ../../ompi/include/mpi.h:1576: error: expected
declaration specifiers or '...' before 'MPI_Aint'*13:28:24*
../../ompi/include/mpi.h:1653: error: expected declaration specifiers
or '...' before 'MPI_Count'*13:28:24* ../../ompi/include/mpi.h:1677:
warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24*
../../ompi/include/mpi.h:1677: error: expected ';', ',' or ')' before
'array_of_displacements'*13:28:24* ../../ompi/include/mpi.h:1681:
warning: type defaults to 'int' in declaration of 'MPI_Aint'*13:28:24*

Re: [OMPI devel] compile error in trunk

2014-02-20 Thread Jeff Squyres (jsquyres)
Was just fixed in https://svn.open-mpi.org/trac/ompi/changeset/30780.


On Feb 20, 2014, at 7:11 AM, Mike Dubman  wrote:

> Hi,
> This commit caused the failure:
>   • Comments about 'db' arguments. 
>   • Fixes #4205: ensure sizeof(MPI_Count) <= sizeof(size_t) 
> 
> 13:28:24 
>   CC   ompi_datatype_args.lo
> 
> 13:28:24 
> In file included from ../../ompi/datatype/ompi_datatype.h:43,
> 
> 13:28:24 
>  from ompi_datatype_args.c:33:
> 
> 13:28:24 
> ../../ompi/include/mpi.h:324: error: expected '=', ',', ';', 'asm' or 
> '__attribute__' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:325: error: expected '=', ',', ';', 'asm' or 
> '__attribute__' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:326: error: expected '=', ',', ';', 'asm' or 
> '__attribute__' before 'MPI_Count'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:374: error: expected declaration specifiers or '...' 
> before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:376: error: expected declaration specifiers or '...' 
> before 'MPI_Offset'
> 
> 13:28:24 
> In file included from ../../ompi/datatype/ompi_datatype.h:43,
> 
> 13:28:24 
>  from ompi_datatype_args.c:33:
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1159: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1164: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1178: error: expected ')' before 'size'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1331: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1332: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1333: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1338: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1340: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1343: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1345: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1347: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1349: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1351: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1353: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1367: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1368: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1369: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1370: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1383: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1384: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1385: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1388: error: expected declaration specifiers or 
> '...' before 'MPI_Offset'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1404: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1424: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1427: error: expected declaration specifiers or 
> '...' before 'MPI_Count'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1430: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1550: warning: type defaults to 'int' in declaration 
> of 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1550: error: expected ';', ',' or ')' before 
> 'sdispls'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1553: warning: type defaults to 'int' in declaration 
> of 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1553: error: expected ';', ',' or ')' before 
> 'sdispls'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1564: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1564: error: expected declaration specifiers or 
> '...' before 'MPI_Aint'
> 
> 13:28:24 
> ../../ompi/include/mpi.h:1566: error: expected declaration spe

[OMPI devel] oshmem tests

2014-02-20 Thread Jeff Squyres (jsquyres)
I took the liberty of committing the openshmem test suite 1.0d to the 
ompi-tests SVN, mainly because there are some post-release patches that are 
necessary to get it to compile/run properly.

Mellanox put some clever workarounds in the MTT ini file for a first round of 
patches, but I'm finding more patches are necessary (e.g., to declare time() 
and srand()).  So let's just commit it, ensure all the patches are send 
upstream, and see if 1.0e will contain all the fixes.

Mellanox: did you send all your existing patches upstream already?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] oshmem test suite errors

2014-02-20 Thread Jeff Squyres (jsquyres)
For all of these, I'm using the openshmem test suite that is now committed to 
the ompi-svn SVN repo.  I don't know if the errors are with the tests or with 
oshmem itself.

1. I'm running the oshmem test suite at 32 processes across 2 16-core servers.  
I'm seeing a segv in "examples/shmem_2dheat.x 10 10".  It seems to run fine at 
lower np values such as 2, 4, and 8; I didn't try to determine where the 
crossover to badness occurs.

2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and usnic 
BTLs, even when running at np=2 (I let it run for several minutes before 
killing it).

3. Ditto for "example/ptp.x 10 10".

4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but 
hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- perhaps 
it would have finished eventually?).

...there's more results (more timeouts and more failures), but they're not yet 
complete, and I've got to keep working on my own features for v1.7.5, so I need 
to move to other things right now.

I think I have oshmem running well enough to add these to Cisco's nightly MTT 
runs now, so the results will start showing up there without needing my manual 
attention.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] udcm_component_query hangs when memlock not infinite

2014-02-20 Thread Jeff Squyres (jsquyres)
Thanks for the report; I filed https://svn.open-mpi.org/trac/ompi/ticket/4290.

On Feb 20, 2014, at 4:34 AM, Brice Goglin  wrote:

> Hello,
> 
> We're setting up a new cluster here. Open MPI 1.7.4 was hanging at
> startup without any error message. The issue appears to be
> udcm_component_query() hanging in finalize() on the sched_yield() loop
> when memlock limit isn't set to unlimited as usual.
> 
> Unfortunately the hangs occur before we print the usual error message
> "you need to set memlock limit to unlimited". If the udcm problem cannot
> be fixed, it would be good to print an error message about memlock not
> being unlimited much earlier.
> 
> Brice
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] oshmem test suite errors

2014-02-20 Thread Ralph Castain
Could you send along the relevant mtt .ini sections?


On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres)  wrote:

> For all of these, I'm using the openshmem test suite that is now committed to 
> the ompi-svn SVN repo.  I don't know if the errors are with the tests or with 
> oshmem itself.
> 
> 1. I'm running the oshmem test suite at 32 processes across 2 16-core 
> servers.  I'm seeing a segv in "examples/shmem_2dheat.x 10 10".  It seems to 
> run fine at lower np values such as 2, 4, and 8; I didn't try to determine 
> where the crossover to badness occurs.
> 
> 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and 
> usnic BTLs, even when running at np=2 (I let it run for several minutes 
> before killing it).
> 
> 3. Ditto for "example/ptp.x 10 10".
> 
> 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but 
> hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- 
> perhaps it would have finished eventually?).
> 
> ...there's more results (more timeouts and more failures), but they're not 
> yet complete, and I've got to keep working on my own features for v1.7.5, so 
> I need to move to other things right now.
> 
> I think I have oshmem running well enough to add these to Cisco's nightly MTT 
> runs now, so the results will start showing up there without needing my 
> manual attention.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] oshmem test suite errors

2014-02-20 Thread Jeff Squyres (jsquyres)
Yes, I've added them to my Cisco MTT ini files in the ompi-svn repo.  Look in 
cisco/mtt/usnic/usnic-trunk.ini and usnic-v1.7.ini.

All relevant sections have "oshmem" in them.

Most are copied from the Mellanox examples, but I made a few 
tweaks/improvements here and there.  I also anticipate adjusting some of the 
timeouts as we get a few MTT oshmem runs done in some of the sections for some 
longer-running tests (at np=32 and possibly 64).


On Feb 20, 2014, at 10:34 AM, Ralph Castain  wrote:

> Could you send along the relevant mtt .ini sections?
> 
> 
> On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> For all of these, I'm using the openshmem test suite that is now committed 
>> to the ompi-svn SVN repo.  I don't know if the errors are with the tests or 
>> with oshmem itself.
>> 
>> 1. I'm running the oshmem test suite at 32 processes across 2 16-core 
>> servers.  I'm seeing a segv in "examples/shmem_2dheat.x 10 10".  It seems to 
>> run fine at lower np values such as 2, 4, and 8; I didn't try to determine 
>> where the crossover to badness occurs.
>> 
>> 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and 
>> usnic BTLs, even when running at np=2 (I let it run for several minutes 
>> before killing it).
>> 
>> 3. Ditto for "example/ptp.x 10 10".
>> 
>> 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but 
>> hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- 
>> perhaps it would have finished eventually?).
>> 
>> ...there's more results (more timeouts and more failures), but they're not 
>> yet complete, and I've got to keep working on my own features for v1.7.5, so 
>> I need to move to other things right now.
>> 
>> I think I have oshmem running well enough to add these to Cisco's nightly 
>> MTT runs now, so the results will start showing up there without needing my 
>> manual attention.
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] oshmem test suite errors

2014-02-20 Thread Jeff Squyres (jsquyres)
On Feb 20, 2014, at 10:44 AM, "Jeff Squyres (jsquyres)"  
wrote:

> Yes, I've added them to my Cisco MTT ini files in the ompi-svn repo.  

Err... I meant ompi-tests SVN repo.  :-)

> Look in cisco/mtt/usnic/usnic-trunk.ini and usnic-v1.7.ini.
> 
> All relevant sections have "oshmem" in them.


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
Hello!

I am having problem using MPI_Comm_spawn under torque. It doesn't work when 
spawning more than 12 processes on various nodes. To be more precise, 
"sometimes" it works, and "sometimes" it doesn't!

Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks 
like below.

node1
node1
node1
node2
node2
node2
node3
node3
node3
node4
node4
node4
node5
node5
node5

I started a hello program (which just spawns itself and of course, the children 
don't spawn), with 

mpiexec -np 3 ./hello

Spawning 3 more processes (on node 2) - works!
spawning 6 more processes (node 2 and 3) - works!
spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
spawning 12 processes (node 2,3,4,5) - "mostly" not!

I ideally want to spawn about 32 processes with large number of nodes, but this 
is at the moment impossible. I have attached my hello program to this email. 

I will be happy to provide any more info or verbose outputs if you could please 
tell me what exactly you would like to see.

Best,
Suraj



hello.c
Description: Binary data


Re: [OMPI devel] oshmem test suite errors

2014-02-20 Thread Brian Barrett
On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres)  wrote:

> For all of these, I'm using the openshmem test suite that is now committed to 
> the ompi-svn SVN repo.  I don't know if the errors are with the tests or with 
> oshmem itself.
> 
> 1. I'm running the oshmem test suite at 32 processes across 2 16-core 
> servers.  I'm seeing a segv in "examples/shmem_2dheat.x 10 10".  It seems to 
> run fine at lower np values such as 2, 4, and 8; I didn't try to determine 
> where the crossover to badness occurs.

My memory is bad and my notes are on a machine I no longer have access to, but 
I did this to the test suite run for Portals SHMEM:

Index: shmem_2dheat.c
===
--- shmem_2dheat.c  (revision 270)
+++ shmem_2dheat.c  (revision 271)
@@ -129,6 +129,11 @@
   p = _num_pes ();
   my_rank = _my_pe ();
 
+  if (p > 8) {
+  fprintf(stderr, "Ignoring test when run with more than 8 pes\n");
+  return 77;
+  }
+
   /* argument processing done by everyone */
   int c, errflg;
   extern char *optarg;

The commit comment was that there was a scaling issue in the code itself, I 
just wish I could remember exactly what it was.

> 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and 
> usnic BTLs, even when running at np=2 (I let it run for several minutes 
> before killing it).

If atomics aren't fast, this test can run for a very long time (also, it takes 
no arguments, so the 10 10 is being ignored).  It's essentially looking for a 
race by blasting 32-bit atomic ops at both parts of a 64 bit word.

> 3. Ditto for "example/ptp.x 10 10".
> 
> 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but 
> hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- 
> perhaps it would have finished eventually?).
> 
> ...there's more results (more timeouts and more failures), but they're not 
> yet complete, and I've got to keep working on my own features for v1.7.5, so 
> I need to move to other things right now.

These start to sound like issues in the code; those last two are pretty decent 
tests.

> I think I have oshmem running well enough to add these to Cisco's nightly MTT 
> runs now, so the results will start showing up there without needing my 
> manual attention.

Woot.

Brian

-- 
 Brian Barrett

 There is an art . . . to flying. The knack lies in learning how to
 throw yourself at the ground and miss.
 Douglas Adams, 'The Hitchhikers Guide to the Galaxy'



Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Ralph Castain
What OMPI version are you using?

On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran  
wrote:

> Hello!
> 
> I am having problem using MPI_Comm_spawn under torque. It doesn't work when 
> spawning more than 12 processes on various nodes. To be more precise, 
> "sometimes" it works, and "sometimes" it doesn't!
> 
> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
> looks like below.
> 
> node1
> node1
> node1
> node2
> node2
> node2
> node3
> node3
> node3
> node4
> node4
> node4
> node5
> node5
> node5
> 
> I started a hello program (which just spawns itself and of course, the 
> children don't spawn), with 
> 
> mpiexec -np 3 ./hello
> 
> Spawning 3 more processes (on node 2) - works!
> spawning 6 more processes (node 2 and 3) - works!
> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
> spawning 12 processes (node 2,3,4,5) - "mostly" not!
> 
> I ideally want to spawn about 32 processes with large number of nodes, but 
> this is at the moment impossible. I have attached my hello program to this 
> email. 
> 
> I will be happy to provide any more info or verbose outputs if you could 
> please tell me what exactly you would like to see.
> 
> Best,
> Suraj
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
I am using 1.7.4! 

On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:

> What OMPI version are you using?
> 
> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran  
> wrote:
> 
>> Hello!
>> 
>> I am having problem using MPI_Comm_spawn under torque. It doesn't work when 
>> spawning more than 12 processes on various nodes. To be more precise, 
>> "sometimes" it works, and "sometimes" it doesn't!
>> 
>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
>> looks like below.
>> 
>> node1
>> node1
>> node1
>> node2
>> node2
>> node2
>> node3
>> node3
>> node3
>> node4
>> node4
>> node4
>> node5
>> node5
>> node5
>> 
>> I started a hello program (which just spawns itself and of course, the 
>> children don't spawn), with 
>> 
>> mpiexec -np 3 ./hello
>> 
>> Spawning 3 more processes (on node 2) - works!
>> spawning 6 more processes (node 2 and 3) - works!
>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>> 
>> I ideally want to spawn about 32 processes with large number of nodes, but 
>> this is at the moment impossible. I have attached my hello program to this 
>> email. 
>> 
>> I will be happy to provide any more info or verbose outputs if you could 
>> please tell me what exactly you would like to see.
>> 
>> Best,
>> Suraj
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Please remove me from this distribution

2014-02-20 Thread Julia.Dudascik.Contractor
Please take me off distribution. 

-Original Message-
From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Suraj
Prabhakaran
Sent: Thursday, February 20, 2014 1:14 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] MPI_Comm_spawn under Torque

I am using 1.7.4! 

On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:

> What OMPI version are you using?
> 
> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran
 wrote:
> 
>> Hello!
>> 
>> I am having problem using MPI_Comm_spawn under torque. It doesn't
work when spawning more than 12 processes on various nodes. To be more
precise, "sometimes" it works, and "sometimes" it doesn't!
>> 
>> Here is my case. I obtain 5 nodes, 3 cores per node and my
$PBS_NODEFILE looks like below.
>> 
>> node1
>> node1
>> node1
>> node2
>> node2
>> node2
>> node3
>> node3
>> node3
>> node4
>> node4
>> node4
>> node5
>> node5
>> node5
>> 
>> I started a hello program (which just spawns itself and of course,
the children don't spawn), with 
>> 
>> mpiexec -np 3 ./hello
>> 
>> Spawning 3 more processes (on node 2) - works!
>> spawning 6 more processes (node 2 and 3) - works!
>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>> 
>> I ideally want to spawn about 32 processes with large number of
nodes, but this is at the moment impossible. I have attached my hello
program to this email. 
>> 
>> I will be happy to provide any more info or verbose outputs if you
could please tell me what exactly you would like to see.
>> 
>> Best,
>> Suraj
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Please remove me from this distribution

2014-02-20 Thread Andreas Schäfer
On 15:45 Thu 20 Feb , julia.dudascik.contrac...@unnpp.gov wrote:
> Please take me off distribution. 

http://www.open-mpi.org/mailman/listinfo.cgi/devel

HTH
-Andreas


-- 
==
Andreas Schäfer
HPC and Grid Computing
Chair of Computer Science 3
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-27910
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==

(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!


signature.asc
Description: Digital signature


Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Ralph Castain
Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't 
work"? Is there some specific behavior you see?

You might try the attached program. It's a simple spawn test we use - 1.7.4 
seems happy with it.



simple_spawn.c
Description: Binary data


On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran  
wrote:

> I am using 1.7.4! 
> 
> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
> 
>> What OMPI version are you using?
>> 
>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran  
>> wrote:
>> 
>>> Hello!
>>> 
>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work when 
>>> spawning more than 12 processes on various nodes. To be more precise, 
>>> "sometimes" it works, and "sometimes" it doesn't!
>>> 
>>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
>>> looks like below.
>>> 
>>> node1
>>> node1
>>> node1
>>> node2
>>> node2
>>> node2
>>> node3
>>> node3
>>> node3
>>> node4
>>> node4
>>> node4
>>> node5
>>> node5
>>> node5
>>> 
>>> I started a hello program (which just spawns itself and of course, the 
>>> children don't spawn), with 
>>> 
>>> mpiexec -np 3 ./hello
>>> 
>>> Spawning 3 more processes (on node 2) - works!
>>> spawning 6 more processes (node 2 and 3) - works!
>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>>> 
>>> I ideally want to spawn about 32 processes with large number of nodes, but 
>>> this is at the moment impossible. I have attached my hello program to this 
>>> email. 
>>> 
>>> I will be happy to provide any more info or verbose outputs if you could 
>>> please tell me what exactly you would like to see.
>>> 
>>> Best,
>>> Suraj
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
Thanks Ralph!

I must have mentioned though. Without the Torque environment, spawning with ssh 
works ok. But Under the torque environment, not. 

I started the simple_spawn with 3 processes and spawned 9 processes (3 per node 
on 3 nodes). 

There is no problem with the Torque environment because all the 9 processes are 
started on the respective nodes. But the MPI_Comm_spawn of the parent and 
MPI_Init of the children, "sometimes" don't return!

This is the output of simple_spawn - which confirms the above statement. 

[pid 31208] starting up!
[pid 31209] starting up!
[pid 31210] starting up!
0 completed MPI_Init
Parent [pid 31208] about to spawn!
1 completed MPI_Init
Parent [pid 31209] about to spawn!
2 completed MPI_Init
Parent [pid 31210] about to spawn!
[pid 28630] starting up!
[pid 28631] starting up!
[pid 9846] starting up!
[pid 9847] starting up!
[pid 9848] starting up!
[pid 6363] starting up!
[pid 6361] starting up!
[pid 6362] starting up!
[pid 28632] starting up!

Any hints?

Best,
Suraj

On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:

> Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't 
> work"? Is there some specific behavior you see?
> 
> You might try the attached program. It's a simple spawn test we use - 1.7.4 
> seems happy with it.
> 
> 
> 
> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran  
> wrote:
> 
>> I am using 1.7.4! 
>> 
>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
>> 
>>> What OMPI version are you using?
>>> 
>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
>>>  wrote:
>>> 
 Hello!
 
 I am having problem using MPI_Comm_spawn under torque. It doesn't work 
 when spawning more than 12 processes on various nodes. To be more precise, 
 "sometimes" it works, and "sometimes" it doesn't!
 
 Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
 looks like below.
 
 node1
 node1
 node1
 node2
 node2
 node2
 node3
 node3
 node3
 node4
 node4
 node4
 node5
 node5
 node5
 
 I started a hello program (which just spawns itself and of course, the 
 children don't spawn), with 
 
 mpiexec -np 3 ./hello
 
 Spawning 3 more processes (on node 2) - works!
 spawning 6 more processes (node 2 and 3) - works!
 spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
 spawning 12 processes (node 2,3,4,5) - "mostly" not!
 
 I ideally want to spawn about 32 processes with large number of nodes, but 
 this is at the moment impossible. I have attached my hello program to this 
 email. 
 
 I will be happy to provide any more info or verbose outputs if you could 
 please tell me what exactly you would like to see.
 
 Best,
 Suraj
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Ralph Castain

On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran  
wrote:

> Thanks Ralph!
> 
> I must have mentioned though. Without the Torque environment, spawning with 
> ssh works ok. But Under the torque environment, not. 

Ah, no - you forgot to mention that point.

> 
> I started the simple_spawn with 3 processes and spawned 9 processes (3 per 
> node on 3 nodes). 
> 
> There is no problem with the Torque environment because all the 9 processes 
> are started on the respective nodes. But the MPI_Comm_spawn of the parent and 
> MPI_Init of the children, "sometimes" don't return!

Seems odd - the launch environment has nothing to do with MPI_Init, so if the 
processes are indeed being started, they should run. One possibility is that 
they aren't correctly getting some wireup info.

Can you configure OMPI --enable-debug and then rerun the example with "-mca 
plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on the 
command line?


> 
> This is the output of simple_spawn - which confirms the above statement. 
> 
> [pid 31208] starting up!
> [pid 31209] starting up!
> [pid 31210] starting up!
> 0 completed MPI_Init
> Parent [pid 31208] about to spawn!
> 1 completed MPI_Init
> Parent [pid 31209] about to spawn!
> 2 completed MPI_Init
> Parent [pid 31210] about to spawn!
> [pid 28630] starting up!
> [pid 28631] starting up!
> [pid 9846] starting up!
> [pid 9847] starting up!
> [pid 9848] starting up!
> [pid 6363] starting up!
> [pid 6361] starting up!
> [pid 6362] starting up!
> [pid 28632] starting up!
> 
> Any hints?
> 
> Best,
> Suraj
> 
> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
> 
>> Hmmm...I don't see anything immediately glaring. What do you mean by 
>> "doesn't work"? Is there some specific behavior you see?
>> 
>> You might try the attached program. It's a simple spawn test we use - 1.7.4 
>> seems happy with it.
>> 
>> 
>> 
>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran 
>>  wrote:
>> 
>>> I am using 1.7.4! 
>>> 
>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
>>> 
 What OMPI version are you using?
 
 On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran 
  wrote:
 
> Hello!
> 
> I am having problem using MPI_Comm_spawn under torque. It doesn't work 
> when spawning more than 12 processes on various nodes. To be more 
> precise, "sometimes" it works, and "sometimes" it doesn't!
> 
> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE 
> looks like below.
> 
> node1
> node1
> node1
> node2
> node2
> node2
> node3
> node3
> node3
> node4
> node4
> node4
> node5
> node5
> node5
> 
> I started a hello program (which just spawns itself and of course, the 
> children don't spawn), with 
> 
> mpiexec -np 3 ./hello
> 
> Spawning 3 more processes (on node 2) - works!
> spawning 6 more processes (node 2 and 3) - works!
> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
> spawning 12 processes (node 2,3,4,5) - "mostly" not!
> 
> I ideally want to spawn about 32 processes with large number of nodes, 
> but this is at the moment impossible. I have attached my hello program to 
> this email. 
> 
> I will be happy to provide any more info or verbose outputs if you could 
> please tell me what exactly you would like to see.
> 
> Best,
> Suraj
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel