Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-05 Thread marcin.krotkiewski
Well, I tried with 3.0.1, and it also hangs. I guess we will try to 
write to the R community about this.


m


On 06/04/2018 11:42 PM, Ben Menadue wrote:

Hi All,

This looks very much like what I reported a couple of weeks ago with 
Rmpi and doMPI — the trace looks the same.  But as far as I could see, 
doMPI does exactly what simple_spawn.c does — use MPI_Comm_spawn to 
create the workers and then MPI_Comm_disconnect them when you call 
closeCluster, and it’s here that it hung.


Ralph suggested trying master, but I haven’t had a chance to try this 
yet. I’ll try it today and see if it works for me now.


Cheers,
Ben


On 5 Jun 2018, at 6:28 am, r...@open-mpi.org  
wrote:


Yes, that does sound like a bug - the #connects must equal the 
#disconnects.



On Jun 4, 2018, at 1:17 PM, marcin.krotkiewski 
mailto:marcin.krotkiew...@gmail.com>> 
wrote:


huh. This code also runs, but it also only displays 4 connect / 
disconnect messages. I should add that the test R script shows 4 
connect, but 8 disconnect messages. Looks like a bug to me, but 
where? I guess we will try to contact R forums and ask there.


Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In 
this case I get a warning about fork being used:


--
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:  [[36000,2],1] (PID 23617)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--

And the process hangs as well - no change.

Marcin



On 06/04/2018 05:27 PM, r...@open-mpi.org wrote:

It might call disconnect more than once if it creates multiple communicators. 
Here’s another test case for that behavior:




On Jun 4, 2018, at 7:08 AM, Bennet Fauber  wrote:

Just out of curiosity, but would using Rmpi and/or doMPI help in any way?

-- bennet


On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
  wrote:

Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in R.
Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
ext2x:client disconnect twice (each PID prints the line twice)

[...]
3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this be
the reason for the hang?

Thanks!

Marcin


On 06/04/2018 03:16 PM,r...@open-mpi.org  wrote:

Try running the attached example dynamic code - if that works, then it
likely is something to do with how R operates.





On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
  wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
simple R script, which starts a few tasks, hangs at the end on diconnect.
Here is the script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
by R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence
I had to turn off all of yalla/mxm/hcoll

--
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_spawn
  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
functionality
--

2. when I do that, the program does create a 'cluster' and starts the ranks,
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x7f66b1e1e995 inpthread_cond_wait@@GLIBC_2.3.2  () from
/lib64/libpthread.so.0
#1  0x7f669eaeba5b in PMIx_Disconnect 

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread Ben Menadue
Hi All,

This looks very much like what I reported a couple of weeks ago with Rmpi and 
doMPI — the trace looks the same.  But as far as I could see, doMPI does 
exactly what simple_spawn.c does — use MPI_Comm_spawn to create the workers and 
then MPI_Comm_disconnect them when you call closeCluster, and it’s here that it 
hung.

Ralph suggested trying master, but I haven’t had a chance to try this yet. I’ll 
try it today and see if it works for me now.

Cheers,
Ben


> On 5 Jun 2018, at 6:28 am, r...@open-mpi.org wrote:
> 
> Yes, that does sound like a bug - the #connects must equal the #disconnects.
> 
> 
>> On Jun 4, 2018, at 1:17 PM, marcin.krotkiewski > > wrote:
>> 
>> huh. This code also runs, but it also only displays 4 connect / disconnect 
>> messages. I should add that the test R script shows 4 connect, but 8 
>> disconnect messages. Looks like a bug to me, but where? I guess we will try 
>> to contact R forums and ask there.
>> Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this case 
>> I get a warning about fork being used:
>> 
>> --
>> A process has executed an operation involving a call to the
>> "fork()" system call to create a child process.  Open MPI is currently
>> operating in a condition that could result in memory corruption or
>> other system errors; your job may hang, crash, or produce silent
>> data corruption.  The use of fork() (or system() or other calls that
>> create child processes) is strongly discouraged.
>> 
>> The process that invoked fork was:
>> 
>>   Local host:  [[36000,2],1] (PID 23617)
>> 
>> If you are *absolutely sure* that your application will successfully
>> and correctly survive a call to fork(), you may disable this warning
>> by setting the mpi_warn_on_fork MCA parameter to 0.
>> --
>> And the process hangs as well - no change.
>> Marcin
>> 
>> 
>> On 06/04/2018 05:27 PM, r...@open-mpi.org  wrote:
>>> It might call disconnect more than once if it creates multiple 
>>> communicators. Here’s another test case for that behavior:
>>> 
>>> 
>>> 
>>> 
 On Jun 4, 2018, at 7:08 AM, Bennet Fauber  
  wrote:
 
 Just out of curiosity, but would using Rmpi and/or doMPI help in any way?
 
 -- bennet
 
 
 On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
   wrote:
> Thanks, Ralph!
> 
> Your code finishes normally, I guess then the reason might be lying in R.
> Running the R code with -mca pmix_base_verbose 1 i see that each rank 
> calls
> ext2x:client disconnect twice (each PID prints the line twice)
> 
> [...]
>3 slaves are spawned successfully. 0 failed.
> [localhost.localdomain:11659] ext2x:client disconnect
> [localhost.localdomain:11661] ext2x:client disconnect
> [localhost.localdomain:11658] ext2x:client disconnect
> [localhost.localdomain:11646] ext2x:client disconnect
> [localhost.localdomain:11658] ext2x:client disconnect
> [localhost.localdomain:11659] ext2x:client disconnect
> [localhost.localdomain:11661] ext2x:client disconnect
> [localhost.localdomain:11646] ext2x:client disconnect
> 
> In your example it's only called once per process.
> 
> Do you have any suspicion where the second call comes from? Might this be
> the reason for the hang?
> 
> Thanks!
> 
> Marcin
> 
> 
> On 06/04/2018 03:16 PM, r...@open-mpi.org  
> wrote:
> 
> Try running the attached example dynamic code - if that works, then it
> likely is something to do with how R operates.
> 
> 
> 
> 
> 
> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
>   
> wrote:
> 
> Hi,
> 
> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
> simple R script, which starts a few tasks, hangs at the end on diconnect.
> Here is the script:
> 
> library(parallel)
> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
> myCluster <- makeCluster(numWorkers, type = "MPI")
> stopCluster(myCluster)
> 
> And here is how I run it:
> 
> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll 
> ^hcoll R
> --slave < mk.R
> 
> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are 
> spawned
> by R dynamically inside the script. So I ran into a number of issues here:
> 
> 1. with HPCX it seems that dynamic starting of ranks is not supported, 
> hence
> I had to turn off all of yalla/mxm/hcoll
> 
> --
> Your application has invoked an 

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread r...@open-mpi.org
Yes, that does sound like a bug - the #connects must equal the #disconnects.


> On Jun 4, 2018, at 1:17 PM, marcin.krotkiewski  
> wrote:
> 
> huh. This code also runs, but it also only displays 4 connect / disconnect 
> messages. I should add that the test R script shows 4 connect, but 8 
> disconnect messages. Looks like a bug to me, but where? I guess we will try 
> to contact R forums and ask there.
> Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this case I 
> get a warning about fork being used:
> 
> --
> A process has executed an operation involving a call to the
> "fork()" system call to create a child process.  Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your job may hang, crash, or produce silent
> data corruption.  The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.
> 
> The process that invoked fork was:
> 
>   Local host:  [[36000,2],1] (PID 23617)
> 
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --
> And the process hangs as well - no change.
> Marcin
> 
> 
> On 06/04/2018 05:27 PM, r...@open-mpi.org  wrote:
>> It might call disconnect more than once if it creates multiple 
>> communicators. Here’s another test case for that behavior:
>> 
>> 
>> 
>> 
>> 
>>> On Jun 4, 2018, at 7:08 AM, Bennet Fauber  
>>>  wrote:
>>> 
>>> Just out of curiosity, but would using Rmpi and/or doMPI help in any way?
>>> 
>>> -- bennet
>>> 
>>> 
>>> On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
>>>   wrote:
 Thanks, Ralph!
 
 Your code finishes normally, I guess then the reason might be lying in R.
 Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
 ext2x:client disconnect twice (each PID prints the line twice)
 
 [...]
3 slaves are spawned successfully. 0 failed.
 [localhost.localdomain:11659] ext2x:client disconnect
 [localhost.localdomain:11661] ext2x:client disconnect
 [localhost.localdomain:11658] ext2x:client disconnect
 [localhost.localdomain:11646] ext2x:client disconnect
 [localhost.localdomain:11658] ext2x:client disconnect
 [localhost.localdomain:11659] ext2x:client disconnect
 [localhost.localdomain:11661] ext2x:client disconnect
 [localhost.localdomain:11646] ext2x:client disconnect
 
 In your example it's only called once per process.
 
 Do you have any suspicion where the second call comes from? Might this be
 the reason for the hang?
 
 Thanks!
 
 Marcin
 
 
 On 06/04/2018 03:16 PM, r...@open-mpi.org  wrote:
 
 Try running the attached example dynamic code - if that works, then it
 likely is something to do with how R operates.
 
 
 
 
 
 On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
   wrote:
 
 Hi,
 
 I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
 simple R script, which starts a few tasks, hangs at the end on diconnect.
 Here is the script:
 
 library(parallel)
 numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
 myCluster <- makeCluster(numWorkers, type = "MPI")
 stopCluster(myCluster)
 
 And here is how I run it:
 
 SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll 
 R
 --slave < mk.R
 
 Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are 
 spawned
 by R dynamically inside the script. So I ran into a number of issues here:
 
 1. with HPCX it seems that dynamic starting of ranks is not supported, 
 hence
 I had to turn off all of yalla/mxm/hcoll
 
 --
 Your application has invoked an MPI function that is not supported in
 this environment.
 
  MPI function: MPI_Comm_spawn
  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
 functionality
 --
 
 2. when I do that, the program does create a 'cluster' and starts the 
 ranks,
 but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:
 
 #0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 
  () from
 /lib64/libpthread.so.0
 #1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
 

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread Bennet Fauber
Marcin,

If you are interested, I can send you the R examples I use to test
things offline.

-- bennet



On Mon, Jun 4, 2018 at 4:17 PM, marcin.krotkiewski
 wrote:
> huh. This code also runs, but it also only displays 4 connect / disconnect
> messages. I should add that the test R script shows 4 connect, but 8
> disconnect messages. Looks like a bug to me, but where? I guess we will try
> to contact R forums and ask there.
>
> Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this case
> I get a warning about fork being used:
>
> --
> A process has executed an operation involving a call to the
> "fork()" system call to create a child process.  Open MPI is currently
> operating in a condition that could result in memory corruption or
> other system errors; your job may hang, crash, or produce silent
> data corruption.  The use of fork() (or system() or other calls that
> create child processes) is strongly discouraged.
>
> The process that invoked fork was:
>
>   Local host:  [[36000,2],1] (PID 23617)
>
> If you are *absolutely sure* that your application will successfully
> and correctly survive a call to fork(), you may disable this warning
> by setting the mpi_warn_on_fork MCA parameter to 0.
> --
>
> And the process hangs as well - no change.
>
> Marcin
>
>
>
> On 06/04/2018 05:27 PM, r...@open-mpi.org wrote:
>
> It might call disconnect more than once if it creates multiple
> communicators. Here’s another test case for that behavior:
>
>
>
>
>
> On Jun 4, 2018, at 7:08 AM, Bennet Fauber  wrote:
>
> Just out of curiosity, but would using Rmpi and/or doMPI help in any way?
>
> -- bennet
>
>
> On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
>  wrote:
>
> Thanks, Ralph!
>
> Your code finishes normally, I guess then the reason might be lying in R.
> Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
> ext2x:client disconnect twice (each PID prints the line twice)
>
> [...]
>3 slaves are spawned successfully. 0 failed.
> [localhost.localdomain:11659] ext2x:client disconnect
> [localhost.localdomain:11661] ext2x:client disconnect
> [localhost.localdomain:11658] ext2x:client disconnect
> [localhost.localdomain:11646] ext2x:client disconnect
> [localhost.localdomain:11658] ext2x:client disconnect
> [localhost.localdomain:11659] ext2x:client disconnect
> [localhost.localdomain:11661] ext2x:client disconnect
> [localhost.localdomain:11646] ext2x:client disconnect
>
> In your example it's only called once per process.
>
> Do you have any suspicion where the second call comes from? Might this be
> the reason for the hang?
>
> Thanks!
>
> Marcin
>
>
> On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:
>
> Try running the attached example dynamic code - if that works, then it
> likely is something to do with how R operates.
>
>
>
>
>
> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
>  wrote:
>
> Hi,
>
> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
> simple R script, which starts a few tasks, hangs at the end on diconnect.
> Here is the script:
>
> library(parallel)
> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
> myCluster <- makeCluster(numWorkers, type = "MPI")
> stopCluster(myCluster)
>
> And here is how I run it:
>
> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
> --slave < mk.R
>
> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
> by R dynamically inside the script. So I ran into a number of issues here:
>
> 1. with HPCX it seems that dynamic starting of ranks is not supported, hence
> I had to turn off all of yalla/mxm/hcoll
>
> --
> Your application has invoked an MPI function that is not supported in
> this environment.
>
>  MPI function: MPI_Comm_spawn
>  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
> functionality
> --
>
> 2. when I do that, the program does create a 'cluster' and starts the ranks,
> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:
>
> #0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at
> client/pmix_client_connect.c:232
> #2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
> ext2x_client.c:1432
> #3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
> dpm/dpm.c:596
> #4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
> pcomm_disconnect.c:67
> #5  0x7f66a16799e9 in mpi_comm_disconnect () from
> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
> #6  0x7f66b2563de5 in 

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread marcin.krotkiewski
huh. This code also runs, but it also only displays 4 connect / 
disconnect messages. I should add that the test R script shows 4 
connect, but 8 disconnect messages. Looks like a bug to me, but where? I 
guess we will try to contact R forums and ask there.


Bennet: I tried to use doMPI + startMPIcluster / closeCluster. In this 
case I get a warning about fork being used:


--
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:  [[36000,2],1] (PID 23617)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--

And the process hangs as well - no change.

Marcin



On 06/04/2018 05:27 PM, r...@open-mpi.org wrote:

It might call disconnect more than once if it creates multiple communicators. 
Here’s another test case for that behavior:






On Jun 4, 2018, at 7:08 AM, Bennet Fauber  wrote:

Just out of curiosity, but would using Rmpi and/or doMPI help in any way?

-- bennet


On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
 wrote:

Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in R.
Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
ext2x:client disconnect twice (each PID prints the line twice)

[...]
3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this be
the reason for the hang?

Thanks!

Marcin


On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:

Try running the attached example dynamic code - if that works, then it
likely is something to do with how R operates.





On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
 wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
simple R script, which starts a few tasks, hangs at the end on diconnect.
Here is the script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
by R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence
I had to turn off all of yalla/mxm/hcoll

--
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_spawn
  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
functionality
--

2. when I do that, the program does create a 'cluster' and starts the ranks,
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at
client/pmix_client_connect.c:232
#2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
ext2x_client.c:1432
#3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
dpm/dpm.c:596
#4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
pcomm_disconnect.c:67
#5  0x7f66a16799e9 in mpi_comm_disconnect () from
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6  0x7f66b2563de5 in do_dotcall () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7  0x7f66b25a207b in bcEval () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9  0x7f66b25b2c62 in R_execClosure () from
/cluster/software/R/3.5.0/lib64/R/lib/libR.so

Might this also be related to the 

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread r...@open-mpi.org
It might call disconnect more than once if it creates multiple communicators. 
Here’s another test case for that behavior:



intercomm_create.c
Description: Binary data



> On Jun 4, 2018, at 7:08 AM, Bennet Fauber  wrote:
> 
> Just out of curiosity, but would using Rmpi and/or doMPI help in any way?
> 
> -- bennet
> 
> 
> On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
>  wrote:
>> Thanks, Ralph!
>> 
>> Your code finishes normally, I guess then the reason might be lying in R.
>> Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
>> ext2x:client disconnect twice (each PID prints the line twice)
>> 
>> [...]
>>3 slaves are spawned successfully. 0 failed.
>> [localhost.localdomain:11659] ext2x:client disconnect
>> [localhost.localdomain:11661] ext2x:client disconnect
>> [localhost.localdomain:11658] ext2x:client disconnect
>> [localhost.localdomain:11646] ext2x:client disconnect
>> [localhost.localdomain:11658] ext2x:client disconnect
>> [localhost.localdomain:11659] ext2x:client disconnect
>> [localhost.localdomain:11661] ext2x:client disconnect
>> [localhost.localdomain:11646] ext2x:client disconnect
>> 
>> In your example it's only called once per process.
>> 
>> Do you have any suspicion where the second call comes from? Might this be
>> the reason for the hang?
>> 
>> Thanks!
>> 
>> Marcin
>> 
>> 
>> On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:
>> 
>> Try running the attached example dynamic code - if that works, then it
>> likely is something to do with how R operates.
>> 
>> 
>> 
>> 
>> 
>> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
>>  wrote:
>> 
>> Hi,
>> 
>> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
>> simple R script, which starts a few tasks, hangs at the end on diconnect.
>> Here is the script:
>> 
>> library(parallel)
>> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
>> myCluster <- makeCluster(numWorkers, type = "MPI")
>> stopCluster(myCluster)
>> 
>> And here is how I run it:
>> 
>> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
>> --slave < mk.R
>> 
>> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
>> by R dynamically inside the script. So I ran into a number of issues here:
>> 
>> 1. with HPCX it seems that dynamic starting of ranks is not supported, hence
>> I had to turn off all of yalla/mxm/hcoll
>> 
>> --
>> Your application has invoked an MPI function that is not supported in
>> this environment.
>> 
>>  MPI function: MPI_Comm_spawn
>>  Reason:   the Yalla (MXM) PML does not support MPI dynamic process
>> functionality
>> --
>> 
>> 2. when I do that, the program does create a 'cluster' and starts the ranks,
>> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:
>> 
>> #0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> /lib64/libpthread.so.0
>> #1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
>> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at
>> client/pmix_client_connect.c:232
>> #2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
>> ext2x_client.c:1432
>> #3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
>> dpm/dpm.c:596
>> #4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
>> pcomm_disconnect.c:67
>> #5  0x7f66a16799e9 in mpi_comm_disconnect () from
>> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
>> #6  0x7f66b2563de5 in do_dotcall () from
>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>> #7  0x7f66b25a207b in bcEval () from
>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>> #8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from
>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>> #9  0x7f66b25b2c62 in R_execClosure () from
>> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>> 
>> Might this also be related to the dynamic rank creation in R?
>> 
>> Thanks!
>> 
>> Marcin
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> 
>> 
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread Bennet Fauber
Just out of curiosity, but would using Rmpi and/or doMPI help in any way?

-- bennet


On Mon, Jun 4, 2018 at 10:00 AM, marcin.krotkiewski
 wrote:
> Thanks, Ralph!
>
> Your code finishes normally, I guess then the reason might be lying in R.
> Running the R code with -mca pmix_base_verbose 1 i see that each rank calls
> ext2x:client disconnect twice (each PID prints the line twice)
>
> [...]
> 3 slaves are spawned successfully. 0 failed.
> [localhost.localdomain:11659] ext2x:client disconnect
> [localhost.localdomain:11661] ext2x:client disconnect
> [localhost.localdomain:11658] ext2x:client disconnect
> [localhost.localdomain:11646] ext2x:client disconnect
> [localhost.localdomain:11658] ext2x:client disconnect
> [localhost.localdomain:11659] ext2x:client disconnect
> [localhost.localdomain:11661] ext2x:client disconnect
> [localhost.localdomain:11646] ext2x:client disconnect
>
> In your example it's only called once per process.
>
> Do you have any suspicion where the second call comes from? Might this be
> the reason for the hang?
>
> Thanks!
>
> Marcin
>
>
> On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:
>
> Try running the attached example dynamic code - if that works, then it
> likely is something to do with how R operates.
>
>
>
>
>
> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski
>  wrote:
>
> Hi,
>
> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A
> simple R script, which starts a few tasks, hangs at the end on diconnect.
> Here is the script:
>
> library(parallel)
> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
> myCluster <- makeCluster(numWorkers, type = "MPI")
> stopCluster(myCluster)
>
> And here is how I run it:
>
> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R
> --slave < mk.R
>
> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned
> by R dynamically inside the script. So I ran into a number of issues here:
>
> 1. with HPCX it seems that dynamic starting of ranks is not supported, hence
> I had to turn off all of yalla/mxm/hcoll
>
> --
> Your application has invoked an MPI function that is not supported in
> this environment.
>
>   MPI function: MPI_Comm_spawn
>   Reason:   the Yalla (MXM) PML does not support MPI dynamic process
> functionality
> --
>
> 2. when I do that, the program does create a 'cluster' and starts the ranks,
> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:
>
> #0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/libpthread.so.0
> #1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20,
> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at
> client/pmix_client_connect.c:232
> #2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at
> ext2x_client.c:1432
> #3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at
> dpm/dpm.c:596
> #4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at
> pcomm_disconnect.c:67
> #5  0x7f66a16799e9 in mpi_comm_disconnect () from
> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
> #6  0x7f66b2563de5 in do_dotcall () from
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #7  0x7f66b25a207b in bcEval () from
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #9  0x7f66b25b2c62 in R_execClosure () from
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
>
> Might this also be related to the dynamic rank creation in R?
>
> Thanks!
>
> Marcin
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread marcin.krotkiewski

Thanks, Ralph!

Your code finishes normally, I guess then the reason might be lying in 
R. Running the R code with -mca pmix_base_verbose 1 i see that each rank 
calls ext2x:client disconnect twice (each PID prints the line twice)


[...]
    3 slaves are spawned successfully. 0 failed.
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect
[localhost.localdomain:11658] ext2x:client disconnect
[localhost.localdomain:11659] ext2x:client disconnect
[localhost.localdomain:11661] ext2x:client disconnect
[localhost.localdomain:11646] ext2x:client disconnect

In your example it's only called once per process.

Do you have any suspicion where the second call comes from? Might this 
be the reason for the hang?


Thanks!

Marcin


On 06/04/2018 03:16 PM, r...@open-mpi.org wrote:

Try running the attached example dynamic code - if that works, then it likely 
is something to do with how R operates.






On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski  
wrote:

Hi,

I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A simple 
R script, which starts a few tasks, hangs at the end on diconnect. Here is the 
script:

library(parallel)
numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
myCluster <- makeCluster(numWorkers, type = "MPI")
stopCluster(myCluster)

And here is how I run it:

SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R 
--slave < mk.R

Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned by 
R dynamically inside the script. So I ran into a number of issues here:

1. with HPCX it seems that dynamic starting of ranks is not supported, hence I 
had to turn off all of yalla/mxm/hcoll

--
Your application has invoked an MPI function that is not supported in
this environment.

   MPI function: MPI_Comm_spawn
   Reason:   the Yalla (MXM) PML does not support MPI dynamic process 
functionality
--

2. when I do that, the program does create a 'cluster' and starts the ranks, 
but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:

#0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
#1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, 
nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at 
client/pmix_client_connect.c:232
#2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at 
ext2x_client.c:1432
#3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at dpm/dpm.c:596
#4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at 
pcomm_disconnect.c:67
#5  0x7f66a16799e9 in mpi_comm_disconnect () from 
/cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
#6  0x7f66b2563de5 in do_dotcall () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#7  0x7f66b25a207b in bcEval () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so
#9  0x7f66b25b2c62 in R_execClosure () from 
/cluster/software/R/3.5.0/lib64/R/lib/libR.so

Might this also be related to the dynamic rank creation in R?

Thanks!

Marcin

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A hang in Rmpi at PMIx_Disconnect

2018-06-04 Thread r...@open-mpi.org
Try running the attached example dynamic code - if that works, then it likely 
is something to do with how R operates.



simple_spawn.c
Description: Binary data



> On Jun 4, 2018, at 3:43 AM, marcin.krotkiewski  
> wrote:
> 
> Hi,
> 
> I have some problems running R + Rmpi with OpenMPI 3.1.0 + PMIx 2.1.1. A 
> simple R script, which starts a few tasks, hangs at the end on diconnect. 
> Here is the script:
> 
> library(parallel)
> numWorkers <- as.numeric(Sys.getenv("SLURM_NTASKS")) - 1
> myCluster <- makeCluster(numWorkers, type = "MPI")
> stopCluster(myCluster)
> 
> And here is how I run it:
> 
> SLURM_NTASKS=5 mpirun -np 1 -mca pml ^yalla -mca mtl ^mxm -mca coll ^hcoll R 
> --slave < mk.R
> 
> Notice -np 1 - this is apparently how you start Rmpi jobs: ranks are spawned 
> by R dynamically inside the script. So I ran into a number of issues here:
> 
> 1. with HPCX it seems that dynamic starting of ranks is not supported, hence 
> I had to turn off all of yalla/mxm/hcoll
> 
> --
> Your application has invoked an MPI function that is not supported in
> this environment.
> 
>   MPI function: MPI_Comm_spawn
>   Reason:   the Yalla (MXM) PML does not support MPI dynamic process 
> functionality
> --
> 
> 2. when I do that, the program does create a 'cluster' and starts the ranks, 
> but hangs in PMIx at MPI Disconnect. Here is the top of the trace from gdb:
> 
> #0  0x7f66b1e1e995 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7f669eaeba5b in PMIx_Disconnect (procs=procs@entry=0x2e25d20, 
> nprocs=nprocs@entry=10, info=info@entry=0x0, ninfo=ninfo@entry=0) at 
> client/pmix_client_connect.c:232
> #2  0x7f669ed6239c in ext2x_disconnect (procs=0x7ffd58322440) at 
> ext2x_client.c:1432
> #3  0x7f66a13bc286 in ompi_dpm_disconnect (comm=0x2cc0810) at 
> dpm/dpm.c:596
> #4  0x7f66a13e8668 in PMPI_Comm_disconnect (comm=0x2cbe058) at 
> pcomm_disconnect.c:67
> #5  0x7f66a16799e9 in mpi_comm_disconnect () from 
> /cluster/software/R-packages/3.5/Rmpi/libs/Rmpi.so
> #6  0x7f66b2563de5 in do_dotcall () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #7  0x7f66b25a207b in bcEval () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #8  0x7f66b25b0fd0 in Rf_eval.localalias.34 () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> #9  0x7f66b25b2c62 in R_execClosure () from 
> /cluster/software/R/3.5.0/lib64/R/lib/libR.so
> 
> Might this also be related to the dynamic rank creation in R?
> 
> Thanks!
> 
> Marcin
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users