Re: [OMPI devel] trunk problem for large-SMP startup?
Ralph Castain wrote: I just ran a 64ppn job without problem. Couple of possibilities come to mind: 1. you might have some stale lib around - try blowing things away and rebuilding 2. there may be a problem in your specific situation. Can you provide some info on what you are doing (e.g., what environment)? I think it was indeed something in the trunk. Rolf vandevaart had the same problem. But, I think it's resolved: (long ago) works ... 20655 broken 20669 broken 20687 works 20728 works 20738 works So, something broke awhile back and got fixed between 20687 and 20728. Okay, I'm back in business and will charge off into the next concrete wall.
Re: [OMPI devel] trunk problem for large-SMP startup?
I'll take a look - offhand, I don't know of anything limiting you to <= 64 ppn On Mar 4, 2009, at 1:49 PM, Eugene Loh wrote: I have a problem starting large SMP jobs (e.g., 64 processes on a single SMP) that might be related to a recent trunk change. (Guessing.) Does the following ring any bells? ... ... ... [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 299 [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 416 [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 378 [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 299 [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 416 [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 378 [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 299 [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 416 [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 378 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Not found" (-13) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [burl-t5440-0:6756] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [burl-t5440-0:6757] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! ... ... ... ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] trunk problem for large-SMP startup?
I have a problem starting large SMP jobs (e.g., 64 processes on a single SMP) that might be related to a recent trunk change. (Guessing.) Does the following ring any bells? ... ... ... [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 299 [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 416 [burl-t5440-0:06798] [[57827,1],42] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 378 [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 299 [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 416 [burl-t5440-0:06800] [[57827,1],44] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 378 [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 299 [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file base/grpcomm_base_modex.c at line 416 [burl-t5440-0:06797] [[57827,1],41] ORTE_ERROR_LOG: Not found in file grpcomm_bad_module.c at line 378 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Not found" (-13) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [burl-t5440-0:6756] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [burl-t5440-0:6757] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! ... ... ... trunk-problem.tar.gz Description: CPIO file