Hi,


It looks like the rmda OSC component does not progress passive RMA
operations at the target during calls to MPI_WIN_(UN)LOCK. As a sample case
take a master-worker program where each worker writes to an entry in an
array exposed in the master's window:



MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);



If (rank == 0)

{

   // Master code

   MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &array);

   MPI_Win_create(array, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
MPI_COMM_WORLD, &win);

   do

   {

      MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);

      nonzeros = count non-zero elements of array

      MPI_Win_unlock(0, win);

   } while(nonzeros < size-1);

   MPI_Win_free(&win);

   MPI_Free_mem(array);

}

else

{

   // Worker code

   int one = 1;

   MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

   // Postpone the RMA with a rank-specific time

   sleep(rank);

   MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);

   MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);

   MPI_Win_unlock(0, win);

   MPI_Win_free(&win);

}



Attached is a complete sample program. The program hangs when run with the
default MCA settings:



$ mpirun -n 3 ./rma.x

[1379003818.571960] 0 workers checked in

[1379003819.571317] Worker 1 acquired lock

[1379003819.571374] Worker 1 unlocking the window

[1379003820.571342] Worker 2 acquired lock

[1379003820.571384] Worker 2 unlocking the window

<hangs>

On the other hand, it works as expected if pt2pt is forced:



$ mpirun --mca osc pt2pt -n 3 ./rma.x | sort

[1379003926.000442] 0 workers checked in

[1379003926.998981] Worker 1 acquired lock

[1379003926.999027] Worker 1 unlocking the window

[1379003926.999076] Worker 1 synched

[1379003926.999078] 1 workers checked in

[1379003927.998917] Worker 2 acquired lock

[1379003927.998940] Worker 2 unlocking the window

[1379003927.998962] Worker 2 synched

[1379003927.998964] 2 workers checked in

[1379003927.998973] All workers checked in

[1379003927.998996] Worker 1 done

[1379003927.998996] Worker 2 done

[1379003927.999099] Master finished



All processes are started on the same host. Open MPI is 1.6.4 without
progression thread. The output from ompi_info is attached. The same
behaviour (hang with rdma, success with pt2pt) is observed when the tcp BTL
is used and when all processes run on separate cluster nodes and talk via
the openib BTL.



Is this a bug in the rdma OSC component or does the sample program violate
the MPI correctness requirements for RMA operations?



Kind regards,

Hristo



--

Hristo Iliev, PhD - High Performance Computing Team

RWTH Aachen University, Center for Computing and Communication

Rechen- und Kommunikationszentrum der RWTH Aachen

Seffenter Weg 23, D 52074 Aachen (Germany)

                 Package: Open MPI pk224...@linuxbmc0601.rz.rwth-aachen.de 
Distribution
                Open MPI: 1.6.4
   Open MPI SVN revision: r28081
   Open MPI release date: Feb 19, 2013
                Open RTE: 1.6.4
   Open RTE SVN revision: r28081
   Open RTE release date: Feb 19, 2013
                    OPAL: 1.6.4
       OPAL SVN revision: r28081
       OPAL release date: Feb 19, 2013
                 MPI API: 2.1
            Ident string: 1.6.4
                  Prefix: /opt/MPI/openmpi-1.6.4/linux/intel
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: linuxbmc0601.rz.RWTH-Aachen.DE
           Configured by: pk224850
           Configured on: Wed May 22 17:01:57 CEST 2013
          Configure host: linuxbmc0601.rz.RWTH-Aachen.DE
                Built by: pk224850
                Built on: Wed May 22 17:18:51 CEST 2013
              Built host: linuxbmc0601.rz.RWTH-Aachen.DE
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: icc
     C compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icc
  C compiler family name: INTEL
      C compiler version: 1110.20101201
            C++ compiler: icpc
   C++ compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icpc
      Fortran77 compiler: ifort -nofor-main -f77rtl -fpconstant -intconstant
  Fortran77 compiler abs: /opt/intel/Compiler/11.1/080/bin/intel64/ifort
      Fortran90 compiler: ifort -nofor-main
  Fortran90 compiler abs: /opt/intel/Compiler/11.1/080/bin/intel64/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: no
   Heterogeneous support: yes
 mpirun default --prefix: yes
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
     Symbol vis. support: yes
   Host topology support: yes
          MPI extensions: affinity example
   FT Checkpoint support: no (checkpoint thread: no)
     VampirTrace support: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6.4)
              MCA memory: linux (MCA v2.0, API v2.0, Component v1.6.4)
           MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6.4)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.6.4)
               MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6.4)
               MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6.4)
               MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6.4)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4)
           MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.6.4)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6.4)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6.4)
             MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6.4)
               MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6.4)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6.4)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6.4)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.6.4)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6.4)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.6.4)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.6.4)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.6.4)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.6.4)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.4)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.6.4)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ras: cm (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ras: loadleveler (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ras: lsf (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.6.4)
               MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.6.4)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.6.4)
               MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.6.4)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.6.4)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.6.4)
               MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.6.4)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.6.4)
              MCA routed: cm (MCA v2.0, API v2.0, Component v1.6.4)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.6.4)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.6.4)
              MCA routed: radix (MCA v2.0, API v2.0, Component v1.6.4)
              MCA routed: slave (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA plm: lsf (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.6.4)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.6.4)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: lsf (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: slave (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: slurmd (MCA v2.0, API v2.0, Component v1.6.4)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.6.4)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.6.4)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.6.4)
             MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.6.4)
            MCA notifier: command (MCA v2.0, API v1.0, Component v1.6.4)
            MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.6.4)
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <mpi.h>

int main (int argc, char **argv)
{
    MPI_Win win;
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Barrier(MPI_COMM_WORLD);

    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0)
    {
       int *array;
       MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &array);
       memset(array, 0, size * sizeof(int));

       MPI_Win_create(array, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
          MPI_COMM_WORLD, &win);

       int ready, ready1 = -1;
       do
       {
          MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
          for (int i = ready = 0; i < size; ready += array[i++]);
          if (ready != ready1)
          {
             printf("[%.6f] %d workers checked in\n", MPI_Wtime(), ready);
             ready1 = ready;
          }
          MPI_Win_unlock(0, win);
       } while (ready < size-1);

       printf("[%.6f] All workers checked in\n", MPI_Wtime());

       MPI_Win_free(&win);

       MPI_Free_mem(array);

       printf("[%.6f] Master finished\n", MPI_Wtime());
    }
    else
    {
       int one = 1;

       MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

       sleep(rank);

       MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
       printf("[%.6f] Worker %d acquired lock\n", MPI_Wtime(), rank);

       MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
       printf("[%.6f] Worker %d unlocking the window\n", MPI_Wtime(), rank);

       MPI_Win_unlock(0, win);
       printf("[%.6f] Worker %d synched\n", MPI_Wtime(), rank);

       MPI_Win_free(&win);
       printf("[%.6f] Worker %d done\n", MPI_Wtime(), rank);
    }

    MPI_Finalize();
    return 0;
}

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to