Dear list,

The bad behaviour now only occurs with version 1.2.X of openmpi (I have tried 1.2.5, 1.2.8 and 1.2.9 with gcc and 1.2.7 and 1.2.9 with pgi cc. Problem is in all of those.). With 1.3.1 I can find no problem at all. So perhaps that means that the problem is solved?

mpirun -np 4 ./test4|head
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60
Result on rank 1 strangely is 50
Result on rank 1 strangely is 30
Result on rank 3 strangely is 90
Result on rank 3 strangely is 80
Result on rank 0 strangely is 50
Result on rank 1 strangely is 40

Without IB there is no problem:
mpirun -mca btl self,tcp -np 4 ./test4
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60

The full (bug fixed code):

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>


int main(int argc, char **argv)
{
  int rank,size,i,j,k;
  const int arrlen=10;
  const int repeattest=1000000;
  double *array;
  MPI_Request *reqarr;
  MPI_Status *mpistat;
  MPI_Datatype STRIDED;
  int torank,fromrank,nreq;
  int sumshouldbe;
  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  MPI_Comm_size(MPI_COMM_WORLD,&size);

  /* Non-contiguous data */
  MPI_Type_vector(arrlen,1,size,MPI_DOUBLE,&STRIDED);
  MPI_Type_commit(&STRIDED);

  array=malloc(arrlen*size *sizeof *array);
  reqarr=malloc(2*size*sizeof *reqarr);
  mpistat=malloc(2*size*sizeof *mpistat);

  /* Setup communication */
  sumshouldbe=0;
  nreq=0;
  for (i=1; i<size; i++)
    {
      torank=rank+i;
      if (torank>=size)
        torank-=size;
      fromrank=rank-i;
      if (fromrank<0)
        fromrank+=size;
      MPI_Recv_init(array+i,1,STRIDED,fromrank,i,MPI_COMM_WORLD,reqarr+nreq);
      nreq++;
      MPI_Send_init(array,1,STRIDED,torank,i,MPI_COMM_WORLD,reqarr+nreq);
      nreq++;
      sumshouldbe+=i;
    }
  printf("Sum should be %g\n",(double)arrlen*sumshouldbe);
  /* Do the tests. */
  for (j=0; j<repeattest; j++)
    {
      double sum=0.;
      /* Init test arrays. Array on first process is initially all
         zero. On second process all one, etc. Same as rank number. */
      for (i=0; i<arrlen*size; i++)
        array[i]=(double)rank;

      /* Start communication */
      MPI_Startall(nreq,reqarr);

      /* Accumulate part of arrays that are not communicated. This
         touches the parts of the arrays that are *not*
         communicated!! */
      for (i=0; i<arrlen; i++)
        sum+=array[i*size];

      /* Wait for communication to finish */
      MPI_Waitall(nreq,reqarr,mpistat);

      /* Accumulate part of arrays that have been communicated. */
      for (i=0; i<arrlen; i++)
        {
          for (k=0; k<size-1; k++)
            sum+=array[i*size+1+k];
        }

      if (sum!=arrlen*sumshouldbe)
        printf("Result on rank %d strangely is %g\n",rank,sum);
    }

  MPI_Finalize();
  return 0;
}

Details about the computer & os is in the original mail (quoted below).


Daniel Spångberg


Den 2009-03-25 14:52:16 skrev Daniel Spångberg <dani...@mkem.uu.se>:

Dear list,

A colleague pointed out an error in my test code. The final loop should not be
  for (i=0; i<arrlen*(size-1); i++)
but rather
  for (i=0; i<arrlen; i++)

details, details... Anyway, I still get problems from time to time with this test code, but I have not yet had time to figure out the circumstances when this happens. I will report back to this list once I know what's going on.

Sorry to trouble you too early!

Daniel Spångberg


Den 2009-03-25 09:44:37 skrev Daniel Spångberg <dani...@mkem.uu.se>:

Dear list,

We've found a problem with openmpi when running over IB when calculation reading elements of an array is overlapping communication to other elements (that are not used in the calculation) of the same array. I have written a small test program (below) that shows this behaviour. When the array is small (arrlen in the code), more problems occur. The problems only occur when using IB (even on the same node!?), using mpirun -mca btl tcp,self the problem vanishes.

The behaviour with 1.2.9 and 1.3.1 is slightly different, where problems occur already for 3 processes with openmpi 1.2.9 but 4 processes are required for problems with 1.3.1. Proper output on 4 processes should just be:
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60

With IB:
mpirun  -np 4 ./test3|head
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60
Result on rank 0 strangely is 1.06316e+248
Result on rank 2 strangely is 1.54396e+262
Result on rank 3 strangely is 3.87325e+233
Result on rank 1 strangely is 1.54396e+262
Result on rank 1 strangely is 1.54396e+262
Result on rank 2 strangely is 1.54396e+262


Info about the system:

openmpi: 1.2.9, 1.3.1

 From ompi_info:
    MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.1)

 From lspci:
04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)

configure picks up ibverbs:
--- MCA component btl:ofud (m4 configuration macro)
checking for MCA component btl:ofud compile mode... dso
checking --with-openib value... simple ok (unspecified)
checking --with-openib-libdir value... simple ok (unspecified)
checking for fcntl.h... (cached) yes
checking sys/poll.h usability... yes
checking sys/poll.h presence... yes
checking for sys/poll.h... yes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
looking for library without search path
checking for ibv_open_device in -libverbs... yes
checking number of arguments to ibv_create_cq... 5
checking whether IBV_EVENT_CLIENT_REREGISTER is declared... yes
checking for ibv_get_device_list... yes
checking for ibv_resize_cq... yes
checking for struct ibv_device.transport_type... yes
checking for ibv_create_xrc_rcv_qp... no
checking rdma/rdma_cma.h usability... yes
checking rdma/rdma_cma.h presence... yes
checking for rdma/rdma_cma.h... yes
checking for rdma_create_id in -lrdmacm... yes
checking for rdma_get_peer_addr... yes
checking for infiniband/driver.h... yes
checking if ConnectX XRC support is enabled... no
checking if OpenFabrics RDMACM support is enabled... yes
checking if OpenFabrics IBCM support is enabled... no
checking if MCA component btl:ofud can compile... yes

--- MCA component btl:openib (m4 configuration macro)
checking for MCA component btl:openib compile mode... dso
checking --with-openib value... simple ok (unspecified)
checking --with-openib-libdir value... simple ok (unspecified)
checking for fcntl.h... (cached) yes
checking for sys/poll.h... (cached) yes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
looking for library without search path
checking for ibv_open_device in -libverbs... yes
checking number of arguments to ibv_create_cq... (cached) 5
checking whether IBV_EVENT_CLIENT_REREGISTER is declared... (cached) yes
checking for ibv_get_device_list... (cached) yes
checking for ibv_resize_cq... (cached) yes
checking for struct ibv_device.transport_type... (cached) yes
checking for ibv_create_xrc_rcv_qp... (cached) no
checking for rdma/rdma_cma.h... (cached) yes
checking for rdma_create_id in -lrdmacm... (cached) yes
checking for rdma_get_peer_addr... yes
checking for infiniband/driver.h... (cached) yes
checking if ConnectX XRC support is enabled... no
checking if OpenFabrics RDMACM support is enabled... yes
checking if OpenFabrics IBCM support is enabled... no
checking for ibv_fork_init... yes
checking for thread support (needed for ibcm/rdmacm)... posix
checking which openib btl cpcs will be built... oob rdmacm
checking if MCA component btl:openib can compile... yes


Compilers: gcc 4.1.2 and pgcc 8.0-4 same problems, optimization level does not matter. (-fast, -O3 or -O0) (64 bit)

CPU: opteron 250
OS: Scientific linux 5.2

If you require any more information, I'll be more than happy to provide it!

Is this a proper way to overlap communication with calculation? Could this be some kind of cache-coherency problem? values in cpu cache already but rdma puts things in memory, although in that case I would expect the sum not to be that off? What would happen if the compiler decided to do non-temporal prefetches (or stores in the general case)?



The code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>


int main(int argc, char **argv)
{
   int rank,size,i,j,k;
   const int arrlen=10;
   const int repeattest=100;
   double *array;
   MPI_Request *reqarr;
   MPI_Status *mpistat;
   MPI_Datatype STRIDED;
   int torank,fromrank,nreq;
   int sumshouldbe;
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Comm_size(MPI_COMM_WORLD,&size);

   /* Non-contiguous data */
   MPI_Type_vector(arrlen,1,size,MPI_DOUBLE,&STRIDED);
   MPI_Type_commit(&STRIDED);

   array=malloc(arrlen*size *sizeof *array);
   reqarr=malloc(2*size*sizeof *reqarr);
   mpistat=malloc(2*size*sizeof *mpistat);

   /* Setup communication */
   sumshouldbe=0;
   nreq=0;
   for (i=1; i<size; i++)
     {
       torank=rank+i;
       if (torank>=size)
         torank-=size;
       fromrank=rank-i;
       if (fromrank<0)
         fromrank+=size;
       MPI_Recv_init(array+i,1,STRIDED,fromrank,i,MPI_COMM_WORLD,reqarr+nreq);
       nreq++;
       MPI_Send_init(array,1,STRIDED,torank,i,MPI_COMM_WORLD,reqarr+nreq);
       nreq++;
       sumshouldbe+=i;
     }
   printf("Sum should be %g\n",(double)arrlen*sumshouldbe);
   /* Do the tests. */
   for (j=0; j<repeattest; j++)
     {
       double sum=0.;
       /* Init test arrays. Array on first process is initially all
          zero. On second process all one, etc. Same as rank number. */
       for (i=0; i<arrlen*size; i++)
         array[i]=(double)rank;

       /* Start communication */
       MPI_Startall(nreq,reqarr);

       /* Accumulate part of arrays that are not communicated. This
          touches the parts of the arrays that are *not*
          communicated!! */
       for (i=0; i<arrlen; i++)
         sum+=array[i*size];

       /* Wait for communication to finish */
       MPI_Waitall(nreq,reqarr,mpistat);

       /* Accumulate part of arrays that have been communicated. */
       for (i=0; i<arrlen*(size-1); i++)
         {
           for (k=0; k<size-1; k++)
             sum+=array[i*size+1+k];
         }

       if (sum!=arrlen*sumshouldbe)
         printf("Result on rank %d strangely is %g\n",rank,sum);
     }

   MPI_Finalize();
   return 0;
}











--
Daniel Spångberg
Materialkemi
Uppsala Universitet

Reply via email to