Just out of curiosity, but would using Rmpi and/or doMPI help in any way? -- bennet
On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> wrote: > Thanks, Ralph! > > Your code finishes normally, I guess then the reason might be lying in R. > Running the R code with -mca pmix_base_verbose 1 i see that each rank calls > ext2x:client disconnect twice (each PID prints the line twice) > > [...] > 3 slaves are spawned successfully. 0 failed. > [localhost.localdomain:11659] ext2x:client disconnect > [localhost.localdomain:11661] ext2x:client disconnect > [localhost.localdomain:11658] ext2x:client disconnect > [localhost.localdomain:11646] ext2x:client disconnect > [localhost.localdomain:11658] ext2x:client disconnect > [localhost.localdomain:11659] ext2x:client disconnect > [localhost.localdomain:11661] ext2x:client disconnect > [localhost.localdomain:11646] ext2x:client disconnect > > In your example it's only called once per process. > > Do you have any suspicion where the second call comes from? Might this be > the reason for the hang? > > Thanks! > > Marcin > > > On 06/04/2018 03:16 PM, r...@open-mpi.org wrote: > > Try running the attached example dynamic code - if that works, then it > likely is something to do with how R operates. > > > > > > On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski > <marcin.krotkiew...@gmail.com> wrote: > > Hi, > > I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A > simple R script, which starts a few tasks, hangs at the end on diconnect. > Here is the script: > > library(parallel) > numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1 > myCluster <- makeCluster(numWorkers, type = "MPI") > stopCluster(myCluster) > > And here is how I run it: > > SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R > --slave < mk.R > > Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned > by R dynamically inside the script. So I ran into a number of issues here: > > 1. with HPCX it seems that dynamic starting of ranks is not supported, hence > I had to turn off all of yalla/mxm/hcoll > > -------------------------------------------------------------------------- > Your application has invoked an MPI function that is not supported in > this environment. > > MPI function: MPI_Comm_spawn > Reason: the Yalla (MXM) PML does not support MPI dynamic process > functionality > -------------------------------------------------------------------------- > > 2. when I do that, the program does create a 'cluster' and starts the ranks, > but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb: > > #0 0x00007f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x00007f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, > nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at > client/pmix_client_connect.c:232 > #2 0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at > ext2x_client.c:1432 > #3 0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at > dpm/dpm.c:596 > #4 0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at > pcomm_disconnect.c:67 > #5 0x00007f66a16799e9 in mpi_comm_disconnect () from > /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so > #6 0x00007f66b2563de5 in do_dotcall () from > /cluster/software/R/3.5.0/lib64/R/lib/libR.so > #7 0x00007f66b25a207b in bcEval () from > /cluster/software/R/3.5.0/lib64/R/lib/libR.so > #8 0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from > /cluster/software/R/3.5.0/lib64/R/lib/libR.so > #9 0x00007f66b25b2c62 in R_execClosure () from > /cluster/software/R/3.5.0/lib64/R/lib/libR.so > > Might this also be related to the dynamic rank creation in R? > > Thanks! > > Marcin > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users