Ralph,

My guess is that ptl.c comes from PSM lib ...

Cheers,

Gilles

On Thursday, September 29, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Spawn definitely does not work with srun. I don’t recognize the name of
> the file that segfaulted - what is “ptl.c”? Is that in your manager program?
>
>
> On Sep 29, 2016, at 6:06 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com
> <javascript:_e(%7B%7D,'cvml','gilles.gouaillar...@gmail.com');>> wrote:
>
> Hi,
>
> I do not expect spawn can work with direct launch (e.g. srun)
>
> Do you have PSM (e.g. Infinipath) hardware ? That could be linked to the
> failure
>
> Can you please try
>
> mpirun --mca pml ob1 --mca btl tcp,sm,self -np 1 --hostfile my_hosts
> ./manager 1
>
> and see if it help ?
>
> Note if you have the possibility, I suggest you first try that without
> slurm, and then within a slurm job
>
> Cheers,
>
> Gilles
>
> On Thursday, September 29, 2016, juraj2...@gmail.com
> <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');> <juraj2...@gmail.com
> <javascript:_e(%7B%7D,'cvml','juraj2...@gmail.com');>> wrote:
>
>> Hello,
>>
>> I am using MPI_Comm_spawn to dynamically create new processes from single
>> manager process. Everything works fine when all the processes are running
>> on the same node. But imposing restriction to run only a single process per
>> node does not work. Below are the errors produced during multinode
>> interactive session and multinode sbatch job.
>>
>> The system I am using is: Linux version 3.10.0-229.el7.x86_64 (
>> buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat
>> 4.8.2-16) (GCC) )
>> I am using Open MPI 2.0.1
>> Slurm is version 15.08.9
>>
>> What is preventing my jobs to spawn on multiple nodes? Does slurm
>> requires some additional configuration to allow it? Is it issue on the MPI
>> side, does it need to be compiled with some special flag (I have compiled
>> it with --enable-mpi-fortran=all --with-pmi)?
>>
>> The code I am launching is here: https://github.com/goghino/dynamicMPI
>>
>> Manager tries to launch one new process (./manager 1), the error produced
>> by requesting each process to be located on different node (interactive
>> session):
>> $ salloc -N 2
>> $ cat my_hosts
>> icsnode37
>> icsnode38
>> $ mpirun -np 1 -npernode 1 --hostfile my_hosts ./manager 1
>> [manager]I'm running MPI 3.1
>> [manager]Runing on node icsnode37
>> icsnode37.12614Assertion failure at ptl.c:183: epaddr == ((void *)0)
>> icsnode38.32443Assertion failure at ptl.c:183: epaddr == ((void *)0)
>> [icsnode37:12614] *** Process received signal ***
>> [icsnode37:12614] Signal: Aborted (6)
>> [icsnode37:12614] Signal code:  (-6)
>> [icsnode38:32443] *** Process received signal ***
>> [icsnode38:32443] Signal: Aborted (6)
>> [icsnode38:32443] Signal code:  (-6)
>>
>> The same example as above via sbatch job submission:
>> $ cat job.sbatch
>> #!/bin/bash
>>
>> #SBATCH --nodes=2
>> #SBATCH --ntasks-per-node=1
>>
>> module load openmpi/2.0.1
>> srun -n 1 -N 1 ./manager 1
>>
>> $ cat output.o
>> [manager]I'm running MPI 3.1
>> [manager]Runing on node icsnode39
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> [icsnode39:9692] *** An error occurred in MPI_Comm_spawn
>> [icsnode39:9692] *** reported by process [1007812608,0]
>> [icsnode39:9692] *** on communicator MPI_COMM_SELF
>> [icsnode39:9692] *** MPI_ERR_SPAWN: could not spawn processes
>> [icsnode39:9692] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [icsnode39:9692] ***    and potentially your MPI job)
>> In: PMI_Abort(50, N/A)
>> slurmstepd: *** STEP 15378.0 ON icsnode39 CANCELLED AT
>> 2016-09-26T16:48:20 ***
>> srun: error: icsnode39: task 0: Exited with exit code 50
>>
>> Thank for any feedback!
>>
>> Best regards,
>> Juraj
>>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> <javascript:_e(%7B%7D,'cvml','users@lists.open-mpi.org');>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to