Hi,
It looks like the rmda OSC component does not progress passive RMA
operations at the target during calls to MPI_WIN_(UN)LOCK. As a sample case
take a master-worker program where each worker writes to an entry in an
array exposed in the master's window:
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
If (rank == 0)
{
// Master code
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &array);
MPI_Win_create(array, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
do
{
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
nonzeros = count non-zero elements of array
MPI_Win_unlock(0, win);
} while(nonzeros < size-1);
MPI_Win_free(&win);
MPI_Free_mem(array);
}
else
{
// Worker code
int one = 1;
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
// Postpone the RMA with a rank-specific time
sleep(rank);
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
MPI_Win_unlock(0, win);
MPI_Win_free(&win);
}
Attached is a complete sample program. The program hangs when run with the
default MCA settings:
$ mpirun -n 3 ./rma.x
[1379003818.571960] 0 workers checked in
[1379003819.571317] Worker 1 acquired lock
[1379003819.571374] Worker 1 unlocking the window
[1379003820.571342] Worker 2 acquired lock
[1379003820.571384] Worker 2 unlocking the window
<hangs>
On the other hand, it works as expected if pt2pt is forced:
$ mpirun --mca osc pt2pt -n 3 ./rma.x | sort
[1379003926.000442] 0 workers checked in
[1379003926.998981] Worker 1 acquired lock
[1379003926.999027] Worker 1 unlocking the window
[1379003926.999076] Worker 1 synched
[1379003926.999078] 1 workers checked in
[1379003927.998917] Worker 2 acquired lock
[1379003927.998940] Worker 2 unlocking the window
[1379003927.998962] Worker 2 synched
[1379003927.998964] 2 workers checked in
[1379003927.998973] All workers checked in
[1379003927.998996] Worker 1 done
[1379003927.998996] Worker 2 done
[1379003927.999099] Master finished
All processes are started on the same host. Open MPI is 1.6.4 without
progression thread. The output from ompi_info is attached. The same
behaviour (hang with rdma, success with pt2pt) is observed when the tcp BTL
is used and when all processes run on separate cluster nodes and talk via
the openib BTL.
Is this a bug in the rdma OSC component or does the sample program violate
the MPI correctness requirements for RMA operations?
Kind regards,
Hristo
--
Hristo Iliev, PhD - High Performance Computing Team
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
Package: Open MPI [email protected] Distribution Open MPI: 1.6.4 Open MPI SVN revision: r28081 Open MPI release date: Feb 19, 2013 Open RTE: 1.6.4 Open RTE SVN revision: r28081 Open RTE release date: Feb 19, 2013 OPAL: 1.6.4 OPAL SVN revision: r28081 OPAL release date: Feb 19, 2013 MPI API: 2.1 Ident string: 1.6.4 Prefix: /opt/MPI/openmpi-1.6.4/linux/intel Configured architecture: x86_64-unknown-linux-gnu Configure host: linuxbmc0601.rz.RWTH-Aachen.DE Configured by: pk224850 Configured on: Wed May 22 17:01:57 CEST 2013 Configure host: linuxbmc0601.rz.RWTH-Aachen.DE Built by: pk224850 Built on: Wed May 22 17:18:51 CEST 2013 Built host: linuxbmc0601.rz.RWTH-Aachen.DE C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: icc C compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icc C compiler family name: INTEL C compiler version: 1110.20101201 C++ compiler: icpc C++ compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icpc Fortran77 compiler: ifort -nofor-main -f77rtl -fpconstant -intconstant Fortran77 compiler abs: /opt/intel/Compiler/11.1/080/bin/intel64/ifort Fortran90 compiler: ifort -nofor-main Fortran90 compiler abs: /opt/intel/Compiler/11.1/080/bin/intel64/ifort C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: yes Thread support: posix (MPI_THREAD_MULTIPLE: no, progress: no) Sparse Groups: no Internal debug support: no MPI interface warnings: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: no Heterogeneous support: yes mpirun default --prefix: yes MPI I/O support: yes MPI_WTIME support: gettimeofday Symbol vis. support: yes Host topology support: yes MPI extensions: affinity example FT Checkpoint support: no (checkpoint thread: no) VampirTrace support: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6.4) MCA memory: linux (MCA v2.0, API v2.0, Component v1.6.4) MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6.4) MCA carto: file (MCA v2.0, API v2.0, Component v1.6.4) MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6.4) MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6.4) MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6.4) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4) MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4) MCA timer: linux (MCA v2.0, API v2.0, Component v1.6.4) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6.4) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6.4) MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6.4) MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6.4) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6.4) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6.4) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6.4) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: basic (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: inter (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: self (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: sm (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: sync (MCA v2.0, API v2.0, Component v1.6.4) MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6.4) MCA io: romio (MCA v2.0, API v2.0, Component v1.6.4) MCA mpool: fake (MCA v2.0, API v2.0, Component v1.6.4) MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.6.4) MCA mpool: sm (MCA v2.0, API v2.0, Component v1.6.4) MCA pml: bfo (MCA v2.0, API v2.0, Component v1.6.4) MCA pml: csum (MCA v2.0, API v2.0, Component v1.6.4) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.6.4) MCA pml: v (MCA v2.0, API v2.0, Component v1.6.4) MCA bml: r2 (MCA v2.0, API v2.0, Component v1.6.4) MCA rcache: vma (MCA v2.0, API v2.0, Component v1.6.4) MCA btl: self (MCA v2.0, API v2.0, Component v1.6.4) MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.4) MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.4) MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.4) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.4) MCA topo: unity (MCA v2.0, API v2.0, Component v1.6.4) MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.6.4) MCA osc: rdma (MCA v2.0, API v2.0, Component v1.6.4) MCA iof: hnp (MCA v2.0, API v2.0, Component v1.6.4) MCA iof: orted (MCA v2.0, API v2.0, Component v1.6.4) MCA iof: tool (MCA v2.0, API v2.0, Component v1.6.4) MCA oob: tcp (MCA v2.0, API v2.0, Component v1.6.4) MCA odls: default (MCA v2.0, API v2.0, Component v1.6.4) MCA ras: cm (MCA v2.0, API v2.0, Component v1.6.4) MCA ras: loadleveler (MCA v2.0, API v2.0, Component v1.6.4) MCA ras: lsf (MCA v2.0, API v2.0, Component v1.6.4) MCA ras: slurm (MCA v2.0, API v2.0, Component v1.6.4) MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.6.4) MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.6.4) MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.6.4) MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.6.4) MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.6.4) MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.6.4) MCA rml: oob (MCA v2.0, API v2.0, Component v1.6.4) MCA routed: binomial (MCA v2.0, API v2.0, Component v1.6.4) MCA routed: cm (MCA v2.0, API v2.0, Component v1.6.4) MCA routed: direct (MCA v2.0, API v2.0, Component v1.6.4) MCA routed: linear (MCA v2.0, API v2.0, Component v1.6.4) MCA routed: radix (MCA v2.0, API v2.0, Component v1.6.4) MCA routed: slave (MCA v2.0, API v2.0, Component v1.6.4) MCA plm: lsf (MCA v2.0, API v2.0, Component v1.6.4) MCA plm: rsh (MCA v2.0, API v2.0, Component v1.6.4) MCA plm: slurm (MCA v2.0, API v2.0, Component v1.6.4) MCA filem: rsh (MCA v2.0, API v2.0, Component v1.6.4) MCA errmgr: default (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: env (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: hnp (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: lsf (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: singleton (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: slave (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: slurm (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: slurmd (MCA v2.0, API v2.0, Component v1.6.4) MCA ess: tool (MCA v2.0, API v2.0, Component v1.6.4) MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.6.4) MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.6.4) MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.6.4) MCA notifier: command (MCA v2.0, API v1.0, Component v1.6.4) MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.6.4)
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <mpi.h>
int main (int argc, char **argv)
{
MPI_Win win;
int rank, size;
MPI_Init(&argc, &argv);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
{
int *array;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &array);
memset(array, 0, size * sizeof(int));
MPI_Win_create(array, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);
int ready, ready1 = -1;
do
{
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
for (int i = ready = 0; i < size; ready += array[i++]);
if (ready != ready1)
{
printf("[%.6f] %d workers checked in\n", MPI_Wtime(), ready);
ready1 = ready;
}
MPI_Win_unlock(0, win);
} while (ready < size-1);
printf("[%.6f] All workers checked in\n", MPI_Wtime());
MPI_Win_free(&win);
MPI_Free_mem(array);
printf("[%.6f] Master finished\n", MPI_Wtime());
}
else
{
int one = 1;
MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
sleep(rank);
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
printf("[%.6f] Worker %d acquired lock\n", MPI_Wtime(), rank);
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
printf("[%.6f] Worker %d unlocking the window\n", MPI_Wtime(), rank);
MPI_Win_unlock(0, win);
printf("[%.6f] Worker %d synched\n", MPI_Wtime(), rank);
MPI_Win_free(&win);
printf("[%.6f] Worker %d done\n", MPI_Wtime(), rank);
}
MPI_Finalize();
return 0;
}
smime.p7s
Description: S/MIME cryptographic signature
