Hi Christoph:
I will try and reproduce this issue and will let you know what I find. There
may be an issue with CUDA IPC support with certain traffic patterns.
Rolf
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter
Sent: Tuesday, August 26, 2014 2:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem
Hey all,
to test the performance of my application I duplicated the call to the function
that will issue the computation on two GPUs 5 times. During the 4th and 5th run
of the algorithm, however, the algorithm yields different results (9 instead of
20):
# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 820 9
121.* 1000 820 9
For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with
cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can be used
(because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose
100", shows that IPC is used.
I tracked my problem down to an MPI_Allgather, which seems not to work since
the first GPU identifies 9 clusters, the second GPU identifies 11 clusters
(makes 20 clusters total). Debugging the application shows, that all clusters
are identified correctly, however, the exchange of the identified clusters
seems not to work: Each MPI process stores its identified clusters in an
buffer, that both processes exchange using MPI_Allgather:
value_type* d_dec = thrust::raw_pointer_cast([0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);
I later discovered, that if I introduce a temporary host buffer, that will
receive the results of both GPUs, all results are computed correctly:
value_type* d_dec = thrust::raw_pointer_cast([0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device
This lead me to the conclusion, that something with OMPIs CUDA IPC seems to
cause the problems (synchronisation and/or fail-silent error) and indeed,
disabling CUDA IPC :
mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0
-np 2 ./double_test ../data/similarities2.double.-300
ex.2.double.2.gpus 1000 1000 0.9
will calculate correct results:
# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is
there a way to force synchronisation (I tried MPI_Barrier() without success),
has anybody discovered similar problems?
I posted some of the code to pastebin: http://pastebin.com/wCmc36k5
Thanks in advance,
Christoph
---
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
---