Re: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

2014-08-26 Thread Rolf vandeVaart
Hi Christoph:
I will try and reproduce this issue and will let you know what I find.  There 
may be an issue with CUDA IPC support with certain traffic patterns.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter
Sent: Tuesday, August 26, 2014 2:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

Hey all,

to test the performance of my application I duplicated the call to the function 
that will issue the computation on two GPUs 5 times. During the 4th and 5th run 
of the algorithm, however, the algorithm yields different results (9 instead of 
20):

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 820 9
121.* 1000 820 9

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with 
cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can be used 
(because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose 
100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work since 
the first GPU  identifies 9 clusters, the second GPU identifies 11 clusters 
(makes 20 clusters total). Debugging the application shows, that all clusters 
are identified correctly, however, the exchange of the identified clusters 
seems not to work: Each MPI process stores its identified clusters in an 
buffer, that both processes exchange using MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast([0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that will 
receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast([0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems to 
cause the problems (synchronisation and/or fail-silent error) and indeed, 
disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0 
-np 2 ./double_test ../data/similarities2.double.-300 
ex.2.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20

Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is 
there a way to force synchronisation (I tried MPI_Barrier() without success), 
has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

2014-08-26 Thread Christoph Winter
Hey all,

to test the performance of my application I duplicated the call to the
function that will issue the computation on two GPUs 5 times. During the
4th and 5th run of the algorithm, however, the algorithm yields
different results (9 instead of 20):

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters
identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 *820* *9*
121.* 1000 *820* *9*

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both
compiled with cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can
be used (because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca
btl_smcuda_cuda_ipc_verbose 100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work
since the first GPU  identifies 9 clusters, the second GPU identifies 11
clusters (makes 20 clusters total). Debugging the application shows,
that all clusters are identified correctly, however, the exchange of the
identified clusters seems not to work: Each MPI process stores its
identified clusters in an buffer, that both processes exchange using
MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast([0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that
will receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast([0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems
to cause the problems (synchronisation and/or fail-silent error) and
indeed, disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca
btl_smcuda_use_cuda_ipc_same_gpu 0 -np 2 ./double_test
../data/similarities2.double.-300 ex.2.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters
identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 *807 20*
121.* 1000 *807 20*

Surprisingly, the wrong results _always_ occur during the 4th and 5th
run. Is there a way to force synchronisation (I tried MPI_Barrier()
without success), has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph