Re: [OMPI users] Put/Get semantics

2016-01-08 Thread Jeff Hammond
Instead of MPI_Alloc_mem and MPI_Win_create, you should use
MPI_Win_allocate.  This will make it much easier for the implementation to
optimize with interprocess shared memory and exploit scalability features
such as symmetric globally addressable memory.  It also obviates the need
to do both MPI_Win_free and MPI_Free_mem.

Based upon what I've seen recently (
https://travis-ci.org/jeffhammond/armci-mpi), using MPI_Win_allocate may
fix some unresolved Open-MPI RMA bugs (
https://github.com/open-mpi/ompi/issues/1275).

As for your synchronization question, instead of

MPI_Rget(b,1,dtype,rproc,displ,1,dtype,win,&request);
MPI_Wait(&request,&status);

and

MPI_Rput(a,1,dtype,rproc,displ,1,dtype,win,&request);
MPI_Wait(&request,&status);

you should use

MPI_Get(b,1,dtype,rproc,displ,1,dtype,win);
MPI_Win_flush_local(1,win);

and

MPI_Put(a,1,dtype,rproc,displ,1,dtype,win);
MPI_Win_flush_local(1,win);

as there is no need to create a request for this usage model.
Request-based RMA entails some implementation overhead in some cases, and
is more likely to be broken since it is not heavily tested. On the other
hand, the non-request RMA has been tested extensively thanks to the
thousands of NWChem jobs I've run using ARMCI-MPI on Cray, InfiniBand, and
other systems.

As I think I've said before on some list, one of the best ways to
understand the mapping between ARMCI and MPI RMA is to look at ARMCI-MPI.

Jeff

On Wed, Jan 6, 2016 at 8:51 AM, Palmer, Bruce J 
wrote:
>
> Hi,
> I’m trying to compare the semantics of MPI RMA with those of ARMCI. I’ve
written a small test program that writes data to a remote processor and
then reads the data back to the original processor. In ARMCI, you should be
able to do this since operations to the same remote processor are completed
in the same order that they are requested on the calling processor. I’ve
implemented this two different ways using MPI RMA. The first is to call
MPI_Win_lock to create a shared lock on the remote processor, then
MPI_Put/MPI_Get to initiate the data transfer and finally MPI_Win_unlock to
force completion of the data transfer. My understanding is that this should
allow you to write and then read data to the same process, since the first
triplet
> MPI_Win_lock
> MPI_Put
> MPI_Win_unlock
> must be completed both locally and remotely before the unlock call
completes. The calls in the second triplet
> MPI_Win_lock
> MPI_Get
> MPI_Win_unlock
> cannot start until the first triplet is done, so if both the put and the
get refer to the same data on the same remote processor, then it should
work.
> The second implementation uses request-based RMA and starts by calling
MPI_Win_lock_all collectively on the window when it is created and
MPI_Win_unlock_all when it is destroy so that the window is always in a
passive synchronization epoch. The put is implement by calling MPI_Rput
followed by calling MPI_Wait on the handle returned from the MPI_Rput call.
Similarly, get is implemented by calling MPI_Rget followed by MPI_Wait. The
wait call guarantees that the operation is completed locally and the data
can then be used. However, from what I understand of the standard, it
doesn’t say anything about the ordering of the operations so conceivably
the put could execute remotely before the get. Inserting an
MPI_Win_flush_all between the MPI_Rput and MPI_Rget should guarantee that
the operations are ordered.
> I’ve written the test program so that it can use either the lock or
request-based implementations and I’ve also included an option that inserts
a fence/flush plus barrier operation between put and get. The different
configurations can be set up by defining some preprocessor symbols at the
top of the program.  The program loops over the test repeatedly and the
current number of loops is set at 2000. The results I get running on a
Linux cluster with an Infiniband network using OpenMPI-1.10.1 on 2
processors on 2 different SMP nodes are as follows:
> Using OpenMPI-1.8.3:
> Request-based implementation without synchronization: 9 successes out of
10 runs
> Request-based implementation with synchronization: 19 successes out of 20
runs
> Lock-based implementation without synchronization: 1 success out of 10
runs
> Lock-based implementation with synchronization: 1 success out of 10 runs
> Using OpenMPI-1.10.1
> Request-based implementation without synchronization: 2 successes out of
10 runs
> Request-based implementation with synchronization: 8 successes out of 10
runs
> Lock-based implementation without synchronization: 4 successes out of 10
runs
> Lock-based implementation with synchronization: 2 successes out of 10 runs
> Except for the request-based implementation without synchronization (in
this case a call to MPI_Win_flush_all), I would expect these to all
succeed. Is there some fault to my thinking here? I’ve attached the test
program
> Bruce Palmer
>
>
> ___
> users mailing list
> us...@open-mpi

Re: [OMPI users] Singleton process spawns additional thread

2016-01-08 Thread Ralph Castain
A singleton will indeed have an extra thread, but it should be quiescent. I’ll 
check the 1.10.2 release candidate and see if it still exhibits that behavior.


> On Jan 7, 2016, at 9:32 PM, Au Eelis  wrote:
> 
> Hi!
> 
> It is in so far related, that one of these threads is actually doing 
> something.
> 
> Btw, I noticed this on two separate machines! A computing cluster with 
> admin-built openmpi and Archlinux with openmpi from the repositories.
> 
> However, running the code with openmpi 1.6.2 and ifort 13.0.0 does not show 
> this behaviour.
> 
> Best regards,
> Stefan
> 
> 
> 
> On 01/07/2016 03:27 PM, Sasso, John (GE Power, Non-GE) wrote:
>> Stefan,  I don't know if this is related to your issue, but FYI...
>> 
>> 
>>> Those are async progress threads - they block unless something requires 
>>> doing
>>> 
>>> 
 On Apr 15, 2015, at 8:36 AM, Sasso, John (GE Power & Water, Non-GE) 
  wrote:
 
 I stumbled upon something while using 'ps -eFL' to view threads of 
 processes, and Google searches have failed to answer my question.  This 
 question holds for OpenMPI 1.6.x and even OpenMPI 1.4.x.
>>  >>
 For a program which is pure MPI (built and run using OpenMPI) and does not 
 implement Pthreads or OpenMP, why is it that each MPI task appears as 
 having 3 threads:
>>  >>
 UID  PID  PPID   LWP  C NLWPSZ   RSS PSR STIME TTY  TIME 
 CMD
 sasso  20512 20493 20512 993 187849 582420 14 11:01 ?   00:26:37 
 /home/sasso/mpi_example.exe
 sasso  20512 20493 20588  03 187849 582420 11 11:01 ?   00:00:00 
 /home/sasso/mpi_example.exe
 sasso  20512 20493 20599  03 187849 582420 12 11:01 ?   00:00:00 
 /home/sasso/mpi_example.exe
>>  >>
 whereas if I compile and run a non-MPI program, 'ps -eFL' shows it running 
 as a single thread?
 
 Granted the CPU utilization (C) for 2 of the 3 threads is zero, but the 
 threads are bound to different processors (11,12,14).   I am curious as to 
 why this is, and no complaining that there is a problem.  Thanks!
>>  >>
 --john
>> 
>> 
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Au Eelis
>> Sent: Thursday, January 07, 2016 7:10 AM
>> To: us...@open-mpi.org
>> Subject: [OMPI users] Singleton process spawns additional thread
>> 
>> Hi!
>> 
>> I have a weird problem with executing a singleton OpenMPI program, where an 
>> additional thread causes full load, while the master thread performs the 
>> actual calculations.
>> 
>> In contrast, executing "mpirun -np 1 [executable]" performs the same 
>> calculation at the same speed but the additional thread is idling.
>> 
>> In my understanding, both calculations should behave in the same way (i.e., 
>> one working thread) for a program which is simply moving some data around 
>> (mainly some MPI_BCAST and MPI_GATHER commands).
>> 
>> I could observe this behaviour in OpenMPI 1.10.1 with ifort 16.0.1 and 
>> gfortran 5.3.0. I could create a minimal working example, which is appended 
>> to this mail.
>> 
>> Am I missing something?
>> 
>> Best regards,
>> Stefan
>> 
>> -
>> 
>> MWE: Compile this with "mpifort main.f90". When executing with "./a.out", 
>> there is thread wasting cycles, while the master thread waits for input. 
>> When executing with "mpirun -np 1 ./a.out" this thread is idling.
>> 
>> program main
>>  use mpi_f08
>>  implicit none
>> 
>>  integer :: ierror,rank
>> 
>>  call MPI_Init(ierror)
>>  call MPI_Comm_Rank(MPI_Comm_World,rank,ierror)
>> 
>>  ! let master thread wait on [RETURN]-key
>>  if (rank == 0) then
>>  read(*,*)
>>  end if
>> 
>>  write(*,*) rank
>> 
>>  call mpi_barrier(mpi_comm_world, ierror) end program 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_mailman_listinfo.cgi_users&d=CwICAg&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=NPeEHKik35WrcHGDl5ZRq4IC6Le5g03o5YoqD9InrHw&s=eRYTNaknio7tNJFdOMTqvdlNNIq9p6evJoQxuvmqrLs&e=
>> Link to this post: 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_community_lists_users_2016_01_28237.php&d=CwICAg&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=NPeEHKik35WrcHGDl5ZRq4IC6Le5g03o5YoqD9InrHw&s=2_axdls1JH4Wm5MlkOXRrtXFb2LLVLCleKVx4ybpltU&e=
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/01/28238.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.o

Re: [OMPI users] Singleton process spawns additional thread

2016-01-08 Thread Au Eelis

Hi!

It is in so far related, that one of these threads is actually doing 
something.


Btw, I noticed this on two separate machines! A computing cluster with 
admin-built openmpi and Archlinux with openmpi from the repositories.


However, running the code with openmpi 1.6.2 and ifort 13.0.0 does not 
show this behaviour.


Best regards,
Stefan



On 01/07/2016 03:27 PM, Sasso, John (GE Power, Non-GE) wrote:

Stefan,  I don't know if this is related to your issue, but FYI...



Those are async progress threads - they block unless something requires doing



On Apr 15, 2015, at 8:36 AM, Sasso, John (GE Power & Water, Non-GE) 
 wrote:

I stumbled upon something while using 'ps -eFL' to view threads of processes, 
and Google searches have failed to answer my question.  This question holds for 
OpenMPI 1.6.x and even OpenMPI 1.4.x.

  >>

For a program which is pure MPI (built and run using OpenMPI) and does not 
implement Pthreads or OpenMP, why is it that each MPI task appears as having 3 
threads:

  >>

UID  PID  PPID   LWP  C NLWPSZ   RSS PSR STIME TTY  TIME CMD
sasso  20512 20493 20512 993 187849 582420 14 11:01 ?   00:26:37 
/home/sasso/mpi_example.exe
sasso  20512 20493 20588  03 187849 582420 11 11:01 ?   00:00:00 
/home/sasso/mpi_example.exe
sasso  20512 20493 20599  03 187849 582420 12 11:01 ?   00:00:00 
/home/sasso/mpi_example.exe

  >>

whereas if I compile and run a non-MPI program, 'ps -eFL' shows it running as a 
single thread?

Granted the CPU utilization (C) for 2 of the 3 threads is zero, but the threads 
are bound to different processors (11,12,14).   I am curious as to why this is, 
and no complaining that there is a problem.  Thanks!

  >>

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Au Eelis
Sent: Thursday, January 07, 2016 7:10 AM
To: us...@open-mpi.org
Subject: [OMPI users] Singleton process spawns additional thread

Hi!

I have a weird problem with executing a singleton OpenMPI program, where an 
additional thread causes full load, while the master thread performs the actual 
calculations.

In contrast, executing "mpirun -np 1 [executable]" performs the same 
calculation at the same speed but the additional thread is idling.

In my understanding, both calculations should behave in the same way (i.e., one 
working thread) for a program which is simply moving some data around (mainly 
some MPI_BCAST and MPI_GATHER commands).

I could observe this behaviour in OpenMPI 1.10.1 with ifort 16.0.1 and gfortran 
5.3.0. I could create a minimal working example, which is appended to this mail.

Am I missing something?

Best regards,
Stefan

-

MWE: Compile this with "mpifort main.f90". When executing with "./a.out", there is thread 
wasting cycles, while the master thread waits for input. When executing with "mpirun -np 1 ./a.out" 
this thread is idling.

program main
  use mpi_f08
  implicit none

  integer :: ierror,rank

  call MPI_Init(ierror)
  call MPI_Comm_Rank(MPI_Comm_World,rank,ierror)

  ! let master thread wait on [RETURN]-key
  if (rank == 0) then
  read(*,*)
  end if

  write(*,*) rank

  call mpi_barrier(mpi_comm_world, ierror) end program 
___
users mailing list
us...@open-mpi.org
Subscription: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_mailman_listinfo.cgi_users&d=CwICAg&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=NPeEHKik35WrcHGDl5ZRq4IC6Le5g03o5YoqD9InrHw&s=eRYTNaknio7tNJFdOMTqvdlNNIq9p6evJoQxuvmqrLs&e=
Link to this post: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_community_lists_users_2016_01_28237.php&d=CwICAg&c=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=NPeEHKik35WrcHGDl5ZRq4IC6Le5g03o5YoqD9InrHw&s=2_axdls1JH4Wm5MlkOXRrtXFb2LLVLCleKVx4ybpltU&e=
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/01/28238.php