Le 07/05/2020 à 11:42, John Hearns via users a écrit : > Patrick, I am sure that you have asked Dell for support on this issue?
No I didn't :-(. I was just accessing these server for a short time to run a bench and the workaround was enough. I'm not using slurm but a local scheduler (OAR) so the problem was not critical for my futur. Patrick > > On Sun, 26 Apr 2020 at 18:09, Patrick Bégou via users > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: > > I have also this problem on servers I'm benching at DELL's lab with > OpenMPI-4.0.3. I've tried a new build of OpenMPI with > "--with-pmi2". No > change. > Finally my work around in the slurm script was to launch my code with > mpirun. As mpirun was only finding one slot per nodes I have used > "--oversubscribe --bind-to core" and checked that every process was > binded on a separate core. It worked but do not ask me why :-) > > Patrick > > Le 24/04/2020 à 20:28, Riebs, Andy via users a écrit : > > Prentice, have you tried something trivial, like "srun -N3 > hostname", to rule out non-OMPI problems? > > > > Andy > > > > -----Original Message----- > > From: users [mailto:users-boun...@lists.open-mpi.org > <mailto:users-boun...@lists.open-mpi.org>] On Behalf Of Prentice > Bisbal via users > > Sent: Friday, April 24, 2020 2:19 PM > > To: Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>>; > Open MPI Users <users@lists.open-mpi.org > <mailto:users@lists.open-mpi.org>> > > Cc: Prentice Bisbal <pbis...@pppl.gov <mailto:pbis...@pppl.gov>> > > Subject: Re: [OMPI users] [External] Re: Can't start jobs with srun. > > > > Okay. I've got Slurm built with pmix support: > > > > $ srun --mpi=list > > srun: MPI types are... > > srun: none > > srun: pmix_v3 > > srun: pmi2 > > srun: openmpi > > srun: pmix > > > > But now when I try to launch a job with srun, the job appears to be > > running, but doesn't do anything - it just hangs in the running > state > > but doesn't do anything. Any ideas what could be wrong, or how > to debug > > this? > > > > I'm also asking around on the Slurm mailing list, too > > > > Prentice > > > > On 4/23/20 3:03 PM, Ralph Castain wrote: > >> You can trust the --mpi=list. The problem is likely that OMPI > wasn't configured --with-pmi2 > >> > >> > >>> On Apr 23, 2020, at 11:59 AM, Prentice Bisbal via users > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: > >>> > >>> --mpi=list shows pmi2 and openmpi as valid values, but if I > set --mpi= to either of them, my job still fails. Why is that? Can > I not trust the output of --mpi=list? > >>> > >>> Prentice > >>> > >>> On 4/23/20 10:43 AM, Ralph Castain via users wrote: > >>>> No, but you do have to explicitly build OMPI with non-PMIx > support if that is what you are going to use. In this case, you > need to configure OMPI --with-pmi2=<path-to-the-pmi2-installation> > >>>> > >>>> You can leave off the path if Slurm (i.e., just > "--with-pmi2") was installed in a standard location as we should > find it there. > >>>> > >>>> > >>>>> On Apr 23, 2020, at 7:39 AM, Prentice Bisbal via users > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: > >>>>> > >>>>> It looks like it was built with PMI2, but not PMIx: > >>>>> > >>>>> $ srun --mpi=list > >>>>> srun: MPI types are... > >>>>> srun: none > >>>>> srun: pmi2 > >>>>> srun: openmpi > >>>>> > >>>>> I did launch the job with srun --mpi=pmi2 .... > >>>>> > >>>>> Does OpenMPI 4 need PMIx specifically? > >>>>> > >>>>> > >>>>> On 4/23/20 10:23 AM, Ralph Castain via users wrote: > >>>>>> Is Slurm built with PMIx support? Did you tell srun to use it? > >>>>>> > >>>>>> > >>>>>>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users > <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: > >>>>>>> > >>>>>>> I'm using OpenMPI 4.0.3 with Slurm 19.05.5 I'm testing > the software with a very simple hello, world MPI program that I've > used reliably for years. When I submit the job through slurm and > use srun to launch the job, I get these errors: > >>>>>>> > >>>>>>> *** An error occurred in MPI_Init > >>>>>>> *** on a NULL communicator > >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > >>>>>>> *** and potentially your MPI job) > >>>>>>> [dawson029.pppl.gov:26070 > <http://dawson029.pppl.gov:26070>] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate > error messages, and not able to guarantee that all other processes > were killed! > >>>>>>> *** An error occurred in MPI_Init > >>>>>>> *** on a NULL communicator > >>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > >>>>>>> *** and potentially your MPI job) > >>>>>>> [dawson029.pppl.gov:26076 > <http://dawson029.pppl.gov:26076>] Local abort before MPI_INIT > completed completed successfully, but am not able to aggregate > error messages, and not able to guarantee that all other processes > were killed! > >>>>>>> > >>>>>>> If I run the same job, but use mpiexec or mpirun instead > of srun, the jobs run just fine. I checked ompi_info to make sure > OpenMPI was compiled with Slurm support: > >>>>>>> > >>>>>>> $ ompi_info | grep slurm > >>>>>>> Configure command line: > '--prefix=/usr/pppl/intel/2019-pkgs/openmpi-4.0.3' > '--disable-silent-rules' '--enable-shared' '--with-pmix=internal' > '--with-slurm' '--with-psm' > >>>>>>> MCA ess: slurm (MCA v2.1.0, API v3.0.0, > Component v4.0.3) > >>>>>>> MCA plm: slurm (MCA v2.1.0, API v2.0.0, > Component v4.0.3) > >>>>>>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, > Component v4.0.3) > >>>>>>> MCA schizo: slurm (MCA v2.1.0, API v1.0.0, > Component v4.0.3) > >>>>>>> > >>>>>>> Any ideas what could be wrong? Do you need any additional > information? > >>>>>>> > >>>>>>> Prentice > >>>>>>> >