Patrick, I am sure that you have asked Dell for support on this issue?

On Sun, 26 Apr 2020 at 18:09, Patrick Bégou via users <
users@lists.open-mpi.org> wrote:

> I have also this problem on servers I'm benching at DELL's lab with
> OpenMPI-4.0.3. I've tried  a new build of OpenMPI with "--with-pmi2". No
> change.
> Finally my work around in the slurm script was to launch my code with
> mpirun. As mpirun was only finding one slot per nodes I have used
> "--oversubscribe --bind-to core" and checked that every process was
> binded on a separate core. It worked but do not ask me why :-)
>
> Patrick
>
> Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit :
> > Prentice, have you tried something trivial, like "srun -N3 hostname", to
> rule out non-OMPI problems?
> >
> > Andy
> >
> > -----Original Message-----
> > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> Prentice Bisbal via users
> > Sent: Friday, April 24, 2020 2:19 PM
> > To: Ralph Castain <r...@open-mpi.org>; Open MPI Users <
> users@lists.open-mpi.org>
> > Cc: Prentice Bisbal <pbis...@pppl.gov>
> > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun.
> >
> > Okay. I've got Slurm built with pmix support:
> >
> > $ srun --mpi=list
> > srun: MPI types are...
> > srun: none
> > srun: pmix_v3
> > srun: pmi2
> > srun: openmpi
> > srun: pmix
> >
> > But now when I try to launch a job with srun, the job appears to be
> > running, but doesn't do anything - it just hangs in the running state
> > but doesn't do anything. Any ideas what could be wrong, or how to debug
> > this?
> >
> > I'm also asking around on the Slurm mailing list, too
> >
> > Prentice
> >
> > On 4/23/20 3:03 PM, Ralph Castain wrote:
> >> You can trust the --mpi=list. The problem is likely that OMPI wasn't
> configured --with-pmi2
> >>
> >>
> >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >>>
> >>> --mpi=list shows pmi2 and openmpi as valid values, but if I set --mpi=
> to either of them, my job still fails. Why is that? Can I not trust the
> output of --mpi=list?
> >>>
> >>> Prentice
> >>>
> >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote:
> >>>> No, but you do have to explicitly build OMPI with non-PMIx support if
> that is what you are going to use. In this case, you need to configure OMPI
> --with-pmi2=<path-to-the-pmi2-installation>
> >>>>
> >>>> You can leave off the path if Slurm (i.e., just "--with-pmi2") was
> installed in a standard location as we should find it there.
> >>>>
> >>>>
> >>>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >>>>>
> >>>>> It looks like it was built with PMI2, but not PMIx:
> >>>>>
> >>>>> $ srun --mpi=list
> >>>>> srun: MPI types are...
> >>>>> srun: none
> >>>>> srun: pmi2
> >>>>> srun: openmpi
> >>>>>
> >>>>> I did launch the job with srun --mpi=pmi2 ....
> >>>>>
> >>>>> Does OpenMPI 4 need PMIx specifically?
> >>>>>
> >>>>>
> >>>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote:
> >>>>>> Is Slurm built with PMIx support? Did you tell srun to use it?
> >>>>>>
> >>>>>>
> >>>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users <
> users@lists.open-mpi.org> wrote:
> >>>>>>>
> >>>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5  I'm testing the
> software with a very simple hello, world MPI program that I've used
> reliably for years. When I submit the job through slurm and use srun to
> launch the job, I get these errors:
> >>>>>>>
> >>>>>>> *** An error occurred in MPI_Init
> >>>>>>> *** on a NULL communicator
> >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >>>>>>> ***    and potentially your MPI job)
> >>>>>>> [dawson029.pppl.gov:26070] Local abort before MPI_INIT completed
> completed successfully, but am not able to aggregate error messages, and
> not able to guarantee that all other processes were killed!
> >>>>>>> *** An error occurred in MPI_Init
> >>>>>>> *** on a NULL communicator
> >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >>>>>>> ***    and potentially your MPI job)
> >>>>>>> [dawson029.pppl.gov:26076] Local abort before MPI_INIT completed
> completed successfully, but am not able to aggregate error messages, and
> not able to guarantee that all other processes were killed!
> >>>>>>>
> >>>>>>> If I run the same job, but use mpiexec or mpirun instead of srun,
> the jobs run just fine. I checked ompi_info to make sure OpenMPI was
> compiled with  Slurm support:
> >>>>>>>
> >>>>>>> $ ompi_info | grep slurm
> >>>>>>>    Configure command line:
> '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' '--disable-silent-rules'
> '--enable-shared' '--with-pmix=internal' '--with-slurm' '--with-psm'
> >>>>>>>                   MCA ess: slurm (MCA v2.1.0, API v3.0.0,
> Component v4.0.3)
> >>>>>>>                   MCA plm: slurm (MCA v2.1.0, API v2.0.0,
> Component v4.0.3)
> >>>>>>>                   MCA ras: slurm (MCA v2.1.0, API v2.0.0,
> Component v4.0.3)
> >>>>>>>                MCA schizo: slurm (MCA v2.1.0, API v1.0.0,
> Component v4.0.3)
> >>>>>>>
> >>>>>>> Any ideas what could be wrong? Do you need any additional
> information?
> >>>>>>>
> >>>>>>> Prentice
> >>>>>>>
>
>

Reply via email to