Hi All, This looks very much like what I reported a couple of weeks ago with Rmpi and doMPI — the trace looks the same. But as far as I could see, doMPI does exactly what simple_spawn.c does — use MPI_Comm_spawn to create the workers and then MPI_Comm_disconnect them when you call closeCluster, and it’s here that it hung.
Ralph suggested trying master, but I haven’t had a chance to try this yet. I’ll try it today and see if it works for me now. Cheers, Ben > On 5 Jun 2018, at 6:28 am, r...@open-mpi.org wrote: > > Yes, that does sound like a bug - the #connects must equal the #disconnects. > > >> On Jun 4, 2018, at 1:17 PM, marcin.krotkiewski <marcin.krotkiew...@gmail.com >> <mailto:marcin.krotkiew...@gmail.com>> wrote: >> >> huh. This code also runs, but it also only displays 4 connect / disconnect >> messages. I should add that the test R script shows 4 connect, but 8 >> disconnect messages. Looks like a bug to me, but where? I guess we will try >> to contact R forums and ask there. >> Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this case >> I get a warning about fork being used: >> >> -------------------------------------------------------------------------- >> A process has executed an operation involving a call to the >> "fork()" system call to create a child process. Open MPI is currently >> operating in a condition that could result in memory corruption or >> other system errors; your job may hang, crash, or produce silent >> data corruption. The use of fork() (or system() or other calls that >> create child processes) is strongly discouraged. >> >> The process that invoked fork was: >> >> Local host: [[36000,2],1] (PID 23617) >> >> If you are *absolutely sure* that your application will successfully >> and correctly survive a call to fork(), you may disable this warning >> by setting the mpi_warn_on_fork MCA parameter to 0. >> -------------------------------------------------------------------------- >> And the process hangs as well - no change. >> Marcin >> >> >> On 06/04/2018 05:27 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote: >>> It might call disconnect more than once if it creates multiple >>> communicators. Here’s another test case for that behavior: >>> >>> >>> >>> >>>> On Jun 4, 2018, at 7:08 AM, Bennet Fauber <ben...@umich.edu> >>>> <mailto:ben...@umich.edu> wrote: >>>> >>>> Just out of curiosity, but would using Rmpi and/or doMPI help in any way? >>>> >>>> -- bennet >>>> >>>> >>>> On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski >>>> <marcin.krotkiew...@gmail.com> <mailto:marcin.krotkiew...@gmail.com> wrote: >>>>> Thanks, Ralph! >>>>> >>>>> Your code finishes normally, I guess then the reason might be lying in R. >>>>> Running the R code with -mca pmix_base_verbose 1 i see that each rank >>>>> calls >>>>> ext2x:client disconnect twice (each PID prints the line twice) >>>>> >>>>> [...] >>>>> 3 slaves are spawned successfully. 0 failed. >>>>> [localhost.localdomain:11659] ext2x:client disconnect >>>>> [localhost.localdomain:11661] ext2x:client disconnect >>>>> [localhost.localdomain:11658] ext2x:client disconnect >>>>> [localhost.localdomain:11646] ext2x:client disconnect >>>>> [localhost.localdomain:11658] ext2x:client disconnect >>>>> [localhost.localdomain:11659] ext2x:client disconnect >>>>> [localhost.localdomain:11661] ext2x:client disconnect >>>>> [localhost.localdomain:11646] ext2x:client disconnect >>>>> >>>>> In your example it's only called once per process. >>>>> >>>>> Do you have any suspicion where the second call comes from? Might this be >>>>> the reason for the hang? >>>>> >>>>> Thanks! >>>>> >>>>> Marcin >>>>> >>>>> >>>>> On 06/04/2018 03:16 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> >>>>> wrote: >>>>> >>>>> Try running the attached example dynamic code - if that works, then it >>>>> likely is something to do with how R operates. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski >>>>> <marcin.krotkiew...@gmail.com> <mailto:marcin.krotkiew...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A >>>>> simple R script, which starts a few tasks, hangs at the end on diconnect. >>>>> Here is the script: >>>>> >>>>> library(parallel) >>>>> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1 >>>>> myCluster <- makeCluster(numWorkers, type = "MPI") >>>>> stopCluster(myCluster) >>>>> >>>>> And here is how I run it: >>>>> >>>>> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll >>>>> ^hcoll R >>>>> --slave < mk.R >>>>> >>>>> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are >>>>> spawned >>>>> by R dynamically inside the script. So I ran into a number of issues here: >>>>> >>>>> 1. with HPCX it seems that dynamic starting of ranks is not supported, >>>>> hence >>>>> I had to turn off all of yalla/mxm/hcoll >>>>> >>>>> -------------------------------------------------------------------------- >>>>> Your application has invoked an MPI function that is not supported in >>>>> this environment. >>>>> >>>>> MPI function: MPI_Comm_spawn >>>>> Reason: the Yalla (MXM) PML does not support MPI dynamic process >>>>> functionality >>>>> -------------------------------------------------------------------------- >>>>> >>>>> 2. when I do that, the program does create a 'cluster' and starts the >>>>> ranks, >>>>> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from >>>>> gdb: >>>>> >>>>> #0 0x00007f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 >>>>> <mailto:pthread_cond_wait@@GLIBC_2.3.2> () from >>>>> /lib64/libpthread.so.0 >>>>> #1 0x00007f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, >>>>> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at >>>>> client/pmix_client_connect.c:232 >>>>> #2 0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at >>>>> ext2x_client.c:1432 >>>>> #3 0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at >>>>> dpm/dpm.c:596 >>>>> #4 0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at >>>>> pcomm_disconnect.c:67 >>>>> #5 0x00007f66a16799e9 in mpi_comm_disconnect () from >>>>> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so >>>>> #6 0x00007f66b2563de5 in do_dotcall () from >>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so >>>>> #7 0x00007f66b25a207b in bcEval () from >>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so >>>>> #8 0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from >>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so >>>>> #9 0x00007f66b25b2c62 in R_execClosure () from >>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so >>>>> >>>>> Might this also be related to the dynamic rank creation in R? >>>>> >>>>> Thanks! >>>>> >>>>> Marcin >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> <https://lists.open-mpi.org/mailman/listinfo/users> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/users >>> <https://lists.open-mpi.org/mailman/listinfo/users> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users