Try running the attached example dynamic code - if that works, then it likely 
is something to do with how R operates.

Attachment: simple_spawn.c
Description: Binary data



> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
> wrote:
> 
> Hi,
> 
> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A 
> simple R script, which starts a few tasks, hangs at the end on diconnect. 
> Here is the script:
> 
> library(parallel)
> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
> myCluster <- makeCluster(numWorkers, type = "MPI")
> stopCluster(myCluster)
> 
> And here is how I run it:
> 
> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R 
> --slave < mk.R
> 
> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned 
> by R dynamically inside the script. So I ran into a number of issues here:
> 
> 1. with HPCX it seems that dynamic starting of ranks is not supported, hence 
> I had to turn off all of yalla/mxm/hcoll
> 
> --------------------------------------------------------------------------
> Your application has invoked an MPI function that is not supported in
> this environment.
> 
>   MPI function: MPI_Comm_spawn
>   Reason:       the Yalla (MXM) PML does not support MPI dynamic process 
> functionality
> --------------------------------------------------------------------------
> 
> 2. when I do that, the program does create a 'cluster' and starts the ranks, 
> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:
> 
> #0  0x00007f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x00007f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, 
> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at 
> client/pmix_client_connect.c:232
> #2  0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at 
> ext2x_client.c:1432
> #3  0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at 
> dpm/dpm.c:596
> #4  0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at 
> pcomm_disconnect.c:67
> #5  0x00007f66a16799e9 in mpi_comm_disconnect () from 
> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
> #6  0x00007f66b2563de5 in do_dotcall () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #7  0x00007f66b25a207b in bcEval () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #8  0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #9  0x00007f66b25b2c62 in R_execClosure () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> 
> Might this also be related to the dynamic rank creation in R?
> 
> Thanks!
> 
> Marcin
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to