Hi All,

This looks very much like what I reported a couple of weeks ago with Rmpi and 
doMPI — the trace looks the same.  But as far as I could see, doMPI does 
exactly what simple_spawn.c does — use MPI_Comm_spawn to create the workers and 
then MPI_Comm_disconnect them when you call closeCluster, and it’s here that it 
hung.

Ralph suggested trying master, but I haven’t had a chance to try this yet. I’ll 
try it today and see if it works for me now.

Cheers,
Ben


> On 5 Jun 2018, at 6:28 am, r...@open-mpi.org wrote:
> 
> Yes, that does sound like a bug - the #connects must equal the #disconnects.
> 
> 
>> On Jun 4, 2018, at 1:17 PM, marcin.krotkiewski <marcin.krotkiew...@gmail.com 
>> <mailto:marcin.krotkiew...@gmail.com>> wrote:
>> 
>> huh. This code also runs, but it also only displays 4 connect / disconnect 
>> messages. I should add that the test R script shows 4 connect, but 8 
>> disconnect messages. Looks like a bug to me, but where? I guess we will try 
>> to contact R forums and ask there.
>> Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this case 
>> I get a warning about fork being used:
>> 
>> --------------------------------------------------------------------------
>> A process has executed an operation involving a call to the
>> "fork()" system call to create a child process.  Open MPI is currently
>> operating in a condition that could result in memory corruption or
>> other system errors; your job may hang, crash, or produce silent
>> data corruption.  The use of fork() (or system() or other calls that
>> create child processes) is strongly discouraged.
>> 
>> The process that invoked fork was:
>> 
>>   Local host:          [[36000,2],1] (PID 23617)
>> 
>> If you are *absolutely sure* that your application will successfully
>> and correctly survive a call to fork(), you may disable this warning
>> by setting the mpi_warn_on_fork MCA parameter to 0.
>> --------------------------------------------------------------------------
>> And the process hangs as well - no change.
>> Marcin
>> 
>> 
>> On 06/04/2018 05:27 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote:
>>> It might call disconnect more than once if it creates multiple 
>>> communicators. Here’s another test case for that behavior:
>>> 
>>> 
>>> 
>>> 
>>>> On Jun 4, 2018, at 7:08 AM, Bennet Fauber <ben...@umich.edu> 
>>>> <mailto:ben...@umich.edu> wrote:
>>>> 
>>>> Just out of curiosity, but would using Rmpi and/or doMPI help in any way?
>>>> 
>>>> -- bennet
>>>> 
>>>> 
>>>> On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
>>>> <marcin.krotkiew...@gmail.com> <mailto:marcin.krotkiew...@gmail.com> wrote:
>>>>> Thanks, Ralph!
>>>>> 
>>>>> Your code finishes normally, I guess then the reason might be lying in R.
>>>>> Running the R code with -mca pmix_base_verbose 1 i see that each rank 
>>>>> calls
>>>>> ext2x:client disconnect twice (each PID prints the line twice)
>>>>> 
>>>>> [...]
>>>>>    3 slaves are spawned successfully. 0 failed.
>>>>> [localhost.localdomain:11659] ext2x:client disconnect
>>>>> [localhost.localdomain:11661] ext2x:client disconnect
>>>>> [localhost.localdomain:11658] ext2x:client disconnect
>>>>> [localhost.localdomain:11646] ext2x:client disconnect
>>>>> [localhost.localdomain:11658] ext2x:client disconnect
>>>>> [localhost.localdomain:11659] ext2x:client disconnect
>>>>> [localhost.localdomain:11661] ext2x:client disconnect
>>>>> [localhost.localdomain:11646] ext2x:client disconnect
>>>>> 
>>>>> In your example it's only called once per process.
>>>>> 
>>>>> Do you have any suspicion where the second call comes from? Might this be
>>>>> the reason for the hang?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Marcin
>>>>> 
>>>>> 
>>>>> On 06/04/2018 03:16 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> 
>>>>> wrote:
>>>>> 
>>>>> Try running the attached example dynamic code - if that works, then it
>>>>> likely is something to do with how R operates.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
>>>>> <marcin.krotkiew...@gmail.com> <mailto:marcin.krotkiew...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
>>>>> simple R script, which starts a few tasks, hangs at the end on diconnect.
>>>>> Here is the script:
>>>>> 
>>>>> library(parallel)
>>>>> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
>>>>> myCluster <- makeCluster(numWorkers, type = "MPI")
>>>>> stopCluster(myCluster)
>>>>> 
>>>>> And here is how I run it:
>>>>> 
>>>>> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll 
>>>>> ^hcoll R
>>>>> --slave < mk.R
>>>>> 
>>>>> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are 
>>>>> spawned
>>>>> by R dynamically inside the script. So I ran into a number of issues here:
>>>>> 
>>>>> 1. with HPCX it seems that dynamic starting of ranks is not supported, 
>>>>> hence
>>>>> I had to turn off all of yalla/mxm/hcoll
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> Your application has invoked an MPI function that is not supported in
>>>>> this environment.
>>>>> 
>>>>>  MPI function: MPI_Comm_spawn
>>>>>  Reason:       the Yalla (MXM) PML does not support MPI dynamic process
>>>>> functionality
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> 2. when I do that, the program does create a 'cluster' and starts the 
>>>>> ranks,
>>>>> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from 
>>>>> gdb:
>>>>> 
>>>>> #0  0x00007f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 
>>>>> <mailto:pthread_cond_wait@@GLIBC_2.3.2> () from
>>>>> /lib64/libpthread.so.0
>>>>> #1  0x00007f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
>>>>> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at
>>>>> client/pmix_client_connect.c:232
>>>>> #2  0x00007f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
>>>>> ext2x_client.c:1432
>>>>> #3  0x00007f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
>>>>> dpm/dpm.c:596
>>>>> #4  0x00007f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
>>>>> pcomm_disconnect.c:67
>>>>> #5  0x00007f66a16799e9 in mpi_comm_disconnect () from
>>>>> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
>>>>> #6  0x00007f66b2563de5 in do_dotcall () from
>>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>>>>> #7  0x00007f66b25a207b in bcEval () from
>>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>>>>> #8  0x00007f66b25b0fd0 in Rf_eval.localalias.34 () from
>>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>>>>> #9  0x00007f66b25b2c62 in R_execClosure () from
>>>>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>>>>> 
>>>>> Might this also be related to the dynamic rank creation in R?
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Marcin
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://lists.open-mpi.org/mailman/listinfo/users 
>>>>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://lists.open-mpi.org/mailman/listinfo/users 
>>>>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>>> https://lists.open-mpi.org/mailman/listinfo/users 
>>>>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>>> https://lists.open-mpi.org/mailman/listinfo/users 
>>>> <https://lists.open-mpi.org/mailman/listinfo/users>
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/users 
>>> <https://lists.open-mpi.org/mailman/listinfo/users>
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to