Re: [OMPI devel] trunk problem for large-SMP startup?

2009-03-05 Thread Eugene Loh

Ralph Castain wrote:

I just ran a 64ppn job without problem. Couple of possibilities come  
to mind:


1. you might have some stale lib around - try blowing things away and  
rebuilding


2. there may be a problem in your specific situation. Can you provide  
some info on what you are doing (e.g., what environment)?


I think it was indeed something in the trunk.  Rolf vandevaart had the 
same problem.  But, I think it's resolved:


(long ago) works
...
20655 broken
20669 broken
20687 works
20728 works
20738 works

So, something broke awhile back and got fixed between 20687 and 20728.  
Okay, I'm back in business and will charge off into the next concrete wall.


Re: [OMPI devel] trunk problem for large-SMP startup?

2009-03-04 Thread Ralph Castain
I'll take a look - offhand, I don't know of anything limiting you to  
<= 64 ppn



On Mar 4, 2009, at 1:49 PM, Eugene Loh wrote:

I have a problem starting large SMP jobs (e.g., 64 processes on a  
single SMP) that might be related to a recent trunk change.   
(Guessing.)  Does the following ring any bells?


...
...
...
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in  
file ess_env_module.c at line 299
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in  
file base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in  
file grpcomm_bad_module.c at line 378
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in  
file ess_env_module.c at line 299
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in  
file base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in  
file grpcomm_bad_module.c at line 378
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in  
file ess_env_module.c at line 299
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in  
file base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in  
file grpcomm_bad_module.c at line 378

--
It looks like MPI_INIT failed for some reason; your parallel process  
is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_grpcomm_modex failed
--> Returned "Not found" (-13) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[burl-t5440-0:6756] Abort before MPI_INIT completed successfully;  
not able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[burl-t5440-0:6757] Abort before MPI_INIT completed successfully;  
not able to guarantee that all other processes were killed!

...
...
...
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] trunk problem for large-SMP startup?

2009-03-04 Thread Eugene Loh
I have a problem starting large SMP jobs (e.g., 64 processes on a single 
SMP) that might be related to a recent trunk change.  (Guessing.)  Does 
the following ring any bells?


...
...
...
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 299
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file 
base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file 
grpcomm_bad_module.c at line 378
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 299
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file 
base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file 
grpcomm_bad_module.c at line 378
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file 
ess_env_module.c at line 299
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file 
base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file 
grpcomm_bad_module.c at line 378

--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

 orte_grpcomm_modex failed
 --> Returned "Not found" (-13) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[burl-t5440-0:6756] Abort before MPI_INIT completed successfully; not 
able to guarantee that all other processes were killed!

*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[burl-t5440-0:6757] Abort before MPI_INIT completed successfully; not 
able to guarantee that all other processes were killed!

...
...
...


trunk-problem.tar.gz
Description: CPIO file