Hi Christoph:
I will try and reproduce this issue and will let you know what I find.  There 
may be an issue with CUDA IPC support with certain traffic patterns.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter
Sent: Tuesday, August 26, 2014 2:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

Hey all,

to test the performance of my application I duplicated the call to the function 
that will issue the computation on two GPUs 5 times. During the 4th and 5th run 
of the algorithm, however, the algorithm yields different results (9 instead of 
20):

# datatype: double
# datapoints: 20000
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 820 9
121.* 1000 820 9

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with 
cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can be used 
(because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose 
100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work since 
the first GPU  identifies 9 clusters, the second GPU identifies 11 clusters 
(makes 20 clusters total). Debugging the application shows, that all clusters 
are identified correctly, however, the exchange of the identified clusters 
seems not to work: Each MPI process stores its identified clusters in an 
buffer, that both processes exchange using MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
MPI_Allgather(    MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
            d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that will 
receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast(&dec[0]);
thrust::host_vector<value_type> h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
            h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems to 
cause the problems (synchronisation and/or fail-silent error) and indeed, 
disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0 
-np 2 ./double_test ../data/similarities20000.double.-300 
ex.20000.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 20000
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20

Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is 
there a way to force synchronisation (I tried MPI_Barrier() without success), 
has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to