vasilis wrote:
Rank 0 accumulates all the res_cpu values into a single array, res.  It
starts with its own res_cpu and then adds all other processes.  When
np=2, that means the order is prescribed.  When np>2, the order is no
longer prescribed and some floating-point rounding variations can start
to occur.
    
Yes you are right. Now, the question is why would these floating-point rounding 
variations occur for np>2? It cannot be  due to a not prescribed order!!
  
The accumulation of res_cpu into res starts with rank 0 and then handles everyone else in arbitrary order (due to MPI_ANY_SOURCE).  With np=2, this means the order is fully deterministic (0 then 1).  With np>2, the order is no longer deterministic.  E.g., for np=3, you could have 0 then 1 then 2, or you could have 0 then 2 then 1.

Here is another version of the code, without MPI_ANY_SOURCE nor MPI_ANY_TAG:

if( mumps_par%MYID .eq. 0 ) THEN
    do jw = 0, nsize-1
        if ( jw /= 0 ) then
            call MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,jw,5,MPI_COMM_WORLD,status1,ierr)
            call MPI_recv(  res_cpu,total_unknowns       ,MPI_DOUBLE_PRECISION,jw,6,MPI_COMM_WORLD,status2,ierr)
            call MPI_recv(  row_cpu,total_elem_cpu*unique,MPI_INTEGER         ,jw,7,MPI_COMM_WORLD,status3,ierr)
            call MPI_recv(  col_cpu,total_elem_cpu*unique,MPI_INTEGER         ,jw,8,MPI_COMM_WORLD,status4,ierr)
        end if
        res         (:   ) = res  (:   )        +   res_cpu(:)
        jacob       (:,jw) = jacob(:,jw)        + jacob_cpu(:)
        position_col(:,jw) = position_col(:,jw) +   col_cpu(:)
        position_row(:,jw) = position_row(:,jw) +   row_cpu(:)
    end do
else
    call MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,5,MPI_COMM_WORLD,ierr)
    call MPI_Send(  res_cpu,total_unknowns       ,MPI_DOUBLE_PRECISION,0,6,MPI_COMM_WORLD,ierr)
    call MPI_Send(  row_cpu,total_elem_cpu*unique,MPI_INTEGER         ,0,7,MPI_COMM_WORLD,ierr)
    call MPI_Send(  col_cpu,total_elem_cpu*unique,MPI_INTEGER         ,0,8,MPI_COMM_WORLD,ierr)
end if

P.S.  It seems to me that you could use MPI collective operations to
implement what you're doing.  E.g., something like:
    
I could use these operations for the res variable (Will it make the summation 
any faster?).
Potentially faster.  It allows the underlying MPI implementation to introduce optimizations (also potentially leading to the nondeterminism as you have observed!).  The other reason to use collective operations, however, is to make your code more readable.
But, I can not use them for the other 3 variables.
  
You can use an MPI_Gather operation to gather the data to rank 0 and then perform the summation on-node.  You need to decide (based on performance, readability, etc.) if you want to make that change.


Reply via email to