I just ran a 64ppn job without problem. Couple of possibilities come
to mind:
1. you might have some stale lib around - try blowing things away and
rebuilding
2. there may be a problem in your specific situation. Can you provide
some info on what you are doing (e.g., what environment)?
Ralph
On Mar 4, 2009, at 2:22 PM, Ralph Castain wrote:
I'll take a look - offhand, I don't know of anything limiting you to
<= 64 ppn
On Mar 4, 2009, at 1:49 PM, Eugene Loh wrote:
I have a problem starting large SMP jobs (e.g., 64 processes on a
single SMP) that might be related to a recent trunk change.
(Guessing.) Does the following ring any bells?
...
...
...
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in
file ess_env_module.c at line 299
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in
file base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in
file grpcomm_bad_module.c at line 378
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in
file ess_env_module.c at line 299
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in
file base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in
file grpcomm_bad_module.c at line 378
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in
file ess_env_module.c at line 299
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in
file base/grpcomm_base_modex.c at line 416
[burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in
file grpcomm_bad_module.c at line 378
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's
some
additional information (which may only be relevant to an Open MPI
developer):
orte_grpcomm_modex failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[burl-t5440-0:6756] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[burl-t5440-0:6757] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
...
...
...
<trunk-problem.tar.gz>_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel