Howard, I should note that this code ran fine up to the point that our sysadmins updated something on the cluster. That makes me think it is a configuration issue, and that it wouldn’t give you any insight if you ran my reproducer. It would succeed for you and still fail for me.
What do you think? I’ll try to get some info from the sysadmins about what they changed. Thanks, Kurt From: Pritchard Jr., Howard <[email protected]> Sent: Monday, July 1, 2024 11:03 AM To: Open MPI Users <[email protected]> Cc: Mccall, Kurt E. (MSFC-EV41) <[email protected]> Subject: Re: [EXTERNAL] [OMPI users] Slurm or OpenMPI error? Hello Kurt, The host name looks a little odd. Do you by chance have a reproducer and instructions on how you’re running it that we could try? Howard From: users <[email protected]<mailto:[email protected]>> on behalf of "Mccall, Kurt E. (MSFC-EV41) via users" <[email protected]<mailto:[email protected]>> Reply-To: Open MPI Users <[email protected]<mailto:[email protected]>> Date: Monday, July 1, 2024 at 9:36 AM To: "OpenMpi User List ([email protected]<mailto:[email protected]>)" <[email protected]<mailto:[email protected]>> Cc: "Mccall, Kurt E. (MSFC-EV41)" <[email protected]<mailto:[email protected]>> Subject: [EXTERNAL] [OMPI users] Slurm or OpenMPI error? Using OpenMPI 5.0.3 and Slurm slurm 20.11.8. Is this error message issued by Slurm or by OpenMPI? A google search on the error message yielded nothing. -------------------------------------------------------------------------- At least one of the requested hosts is not included in the current allocation. Missing requested host: n001^X Please check your allocation or your request. -------------------------------------------------------------------------- Following that error, MPI_Comm_Spawn failed on the named node, n001. [n001:00000] *** An error occurred in MPI_Comm_spawn [n001:00000] *** reported by process [595787777,0] [n001:00000] *** on communicator MPI_COMM_SELF [n001:00000] *** MPI_ERR_UNKNOWN: unknown error [n001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [n001:00000] *** and MPI will try to terminate your MPI job as well) ^@1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal ^@1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal Thanks, Kurt
