Ralph, My guess is that ptl.c comes from PSM lib ...
Cheers, Gilles On Thursday, September 29, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote: > Spawn definitely does not work with srun. I don’t recognize the name of > the file that segfaulted - what is “ptl.c”? Is that in your manager program? > > > On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com > <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote: > > Hi, > > I do not expect spawn can work with direct launch (e.g. srun) > > Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the > failure > > Can you please try > > mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts > ./manager 1 > > and see if it help ? > > Note if you have the possibility, I suggest you first try that without > slurm, and then within a slurm job > > Cheers, > > Gilles > > On Thursday, September 29, 2016, juraj2...@gmail.com > <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');> <juraj2...@gmail.com > <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');>> wrote: > >> Hello, >> >> I am using MPI_Comm_spawn to dynamically create new processes from single >> manager process. Everything works fine when all the processes are running >> on the same node. But imposing restriction to run only a single process per >> node does not work. Below are the errors produced during multinode >> interactive session and multinode sbatch job. >> >> The system I am using is: Linux version 3.10.0-229.el7.x86_64 ( >> buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat >> 4.8.2-16) (GCC) ) >> I am using Open MPI 2.0.1 >> Slurm is version 15.08.9 >> >> What is preventing my jobs to spawn on multiple nodes? Does slurm >> requires some additional configuration to allow it? Is it issue on the MPI >> side, does it need to be compiled with some special flag (I have compiled >> it with --enable-mpi-fortran=all --with-pmi)? >> >> The code I am launching is here: https://github.com/goghino/dynamicMPI >> >> Manager tries to launch one new process (./manager 1), the error produced >> by requesting each process to be located on different node (interactive >> session): >> $ salloc -N 2 >> $ cat my_hosts >> icsnode37 >> icsnode38 >> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1 >> [manager]I'm running MPI 3.1 >> [manager]Runing on node icsnode37 >> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0) >> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0) >> [icsnode37:12614] *** Process received signal *** >> [icsnode37:12614] Signal: Aborted (6) >> [icsnode37:12614] Signal code: (-6) >> [icsnode38:32443] *** Process received signal *** >> [icsnode38:32443] Signal: Aborted (6) >> [icsnode38:32443] Signal code: (-6) >> >> The same example as above via sbatch job submission: >> $ cat job.sbatch >> #!/bin/bash >> >> #SBATCH --nodes=2 >> #SBATCH --ntasks-per-node=1 >> >> module load openmpi/2.0.1 >> srun -n 1 -N 1 ./manager 1 >> >> $ cat output.o >> [manager]I'm running MPI 3.1 >> [manager]Runing on node icsnode39 >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn >> [icsnode39:9692] *** reported by process [1007812608,0] >> [icsnode39:9692] *** on communicator MPI_COMM_SELF >> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes >> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >> will now abort, >> [icsnode39:9692] *** and potentially your MPI job) >> In: PMI_Abort(50, N/A) >> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT >> 2016-09-26T16:48:20 *** >> srun: error: icsnode39: task 0: Exited with exit code 50 >> >> Thank for any feedback! >> >> Best regards, >> Juraj >> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > <javascript:_e(%7B%7D,'cvml','users@lists.open-mpi.org');> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users