Le 07/05/2020 à 11:42, John Hearns via users a écrit :
> Patrick, I am sure that you have asked Dell for support on this issue?

No I didn't :-(. I was just accessing these server for a short time to
run a bench and the workaround was enough. I'm not using slurm but a
local scheduler (OAR) so the problem was not critical for my futur.


Patrick

>
> On Sun, 26 Apr 2020 at 18:09, Patrick Bégou via users
> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>
>     I have also this problem on servers I'm benching at DELL's lab with
>     OpenMPI-4.0.3. I've tried  a new build of OpenMPI with
>     "--with-pmi2". No
>     change.
>     Finally my work around in the slurm script was to launch my code with
>     mpirun. As mpirun was only finding one slot per nodes I have used
>     "--oversubscribe --bind-to core" and checked that every process was
>     binded on a separate core. It worked but do not ask me why :-)
>
>     Patrick
>
>     Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
>     > Prentice, have you tried something trivial, like "srun -N3
>     hostname", to rule out non-OMPI problems?
>     >
>     > Andy
>     >
>     > -----Original Message-----
>     > From: users [mailto:users-boun...@lists.open-mpi.org
>     <mailto:users-boun...@lists.open-mpi.org>] On Behalf Of Prentice
>     Bisbal via users
>     > Sent: Friday, April 24, 2020 2:19 PM
>     > To: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>;
>     Open MPI Users <users@lists.open-mpi.org
>     <mailto:users@lists.open-mpi.org>>
>     > Cc: Prentice Bisbal <pbis...@pppl.gov <mailto:pbis...@pppl.gov>>
>     > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
>     >
>     > Okay. I've got Slurm built with pmix support:
>     >
>     > $ srun --mpi=list
>     > srun: MPI types are...
>     > srun: none
>     > srun: pmix_v3
>     > srun: pmi2
>     > srun: openmpi
>     > srun: pmix
>     >
>     > But now when I try to launch a job with srun, the job appears to be
>     > running, but doesn't do anything - it just hangs in the running
>     state
>     > but doesn't do anything. Any ideas what could be wrong, or how
>     to debug
>     > this?
>     >
>     > I'm also asking around on the Slurm mailing list, too
>     >
>     > Prentice
>     >
>     > On 4/23/20 3:03 PM, Ralph Castain wrote:
>     >> You can trust the --mpi=list. The problem is likely that OMPI
>     wasn't configured --with-pmi2
>     >>
>     >>
>     >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users
>     <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>     >>>
>     >>> --mpi=list shows pmi2 and openmpi as valid values, but if I
>     set --mpi= to either of them, my job still fails. Why is that? Can
>     I not trust the output of --mpi=list?
>     >>>
>     >>> Prentice
>     >>>
>     >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
>     >>>> No, but you do have to explicitly build OMPI with non-PMIx
>     support if that is what you are going to use. In this case, you
>     need to configure OMPI --with-pmi2=<path-to-the-pmi2-installation>
>     >>>>
>     >>>> You can leave off the path if Slurm (i.e., just
>     "--with-pmi2") was installed in a standard location as we should
>     find it there.
>     >>>>
>     >>>>
>     >>>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users
>     <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>     >>>>>
>     >>>>> It looks like it was built with PMI2, but not PMIx:
>     >>>>>
>     >>>>> $ srun --mpi=list
>     >>>>> srun: MPI types are...
>     >>>>> srun: none
>     >>>>> srun: pmi2
>     >>>>> srun: openmpi
>     >>>>>
>     >>>>> I did launch the job with srun --mpi=pmi2 ....
>     >>>>>
>     >>>>> Does OpenMPI 4 need PMIx specifically?
>     >>>>>
>     >>>>>
>     >>>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
>     >>>>>> Is Slurm built with PMIx support? Did you tell srun to use it?
>     >>>>>>
>     >>>>>>
>     >>>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users
>     <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>     >>>>>>>
>     >>>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing
>     the software with a very simple hello, world MPI program that I've
>     used reliably for years. When I submit the job through slurm and
>     use srun to launch the job, I get these errors:
>     >>>>>>>
>     >>>>>>> *** An error occurred in MPI_Init
>     >>>>>>> *** on a NULL communicator
>     >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>     will now abort,
>     >>>>>>> ***    and potentially your MPI job)
>     >>>>>>> [dawson029.pppl.gov:26070
>     <http://dawson029.pppl.gov:26070>] Local abort before MPI_INIT
>     completed completed successfully, but am not able to aggregate
>     error messages, and not able to guarantee that all other processes
>     were killed!
>     >>>>>>> *** An error occurred in MPI_Init
>     >>>>>>> *** on a NULL communicator
>     >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>     will now abort,
>     >>>>>>> ***    and potentially your MPI job)
>     >>>>>>> [dawson029.pppl.gov:26076
>     <http://dawson029.pppl.gov:26076>] Local abort before MPI_INIT
>     completed completed successfully, but am not able to aggregate
>     error messages, and not able to guarantee that all other processes
>     were killed!
>     >>>>>>>
>     >>>>>>> If I run the same job, but use mpiexec or mpirun instead
>     of srun, the jobs run just fine. I checked ompi_info to make sure
>     OpenMPI was compiled with  Slurm support:
>     >>>>>>>
>     >>>>>>> $ ompi_info | grep slurm
>     >>>>>>>    Configure command line:
>     '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3'
>     '--disable-silent-rules' '--enable-shared' '--with-pmix=internal'
>     '--with-slurm' '--with-psm'
>     >>>>>>>                   MCA ess: slurm (MCA v2.1.0, API v3.0.0,
>     Component v4.0.3)
>     >>>>>>>                   MCA plm: slurm (MCA v2.1.0, API v2.0.0,
>     Component v4.0.3)
>     >>>>>>>                   MCA ras: slurm (MCA v2.1.0, API v2.0.0,
>     Component v4.0.3)
>     >>>>>>>                MCA schizo: slurm (MCA v2.1.0, API v1.0.0,
>     Component v4.0.3)
>     >>>>>>>
>     >>>>>>> Any ideas what could be wrong? Do you need any additional
>     information?
>     >>>>>>>
>     >>>>>>> Prentice
>     >>>>>>>
>

Reply via email to