Hi Christoph: I will try and reproduce this issue and will let you know what I find. There may be an issue with CUDA IPC support with certain traffic patterns. Rolf
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter Sent: Tuesday, August 26, 2014 2:46 AM To: us...@open-mpi.org Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem Hey all, to test the performance of my application I duplicated the call to the function that will issue the computation on two GPUs 5 times. During the 4th and 5th run of the algorithm, however, the algorithm yields different results (9 instead of 20): # datatype: double # datapoints: 20000 # max_iterations: 1000 # conv_iterations: 1000 # damping: 0.9 # communicator.size: 2 # time elapsed [s]; iterations executed; convergent since; clusters identified 121.* 1000 807 20 121.* 1000 807 20 121.* 1000 807 20 121.* 1000 820 9 121.* 1000 820 9 For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with cuda-awareness. The CUDA Toolkit version is 6.0. Both GPUs are under the control of one single CPU, so that CUDA IPC can be used (because no QPI link has to be traversed). Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose 100", shows that IPC is used. I tracked my problem down to an MPI_Allgather, which seems not to work since the first GPU identifies 9 clusters, the second GPU identifies 11 clusters (makes 20 clusters total). Debugging the application shows, that all clusters are identified correctly, however, the exchange of the identified clusters seems not to work: Each MPI process stores its identified clusters in an buffer, that both processes exchange using MPI_Allgather: value_type* d_dec = thrust::raw_pointer_cast(&dec[0]); MPI_Allgather( MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, d_dec, columns, MPI_DOUBLE, communicator); I later discovered, that if I introduce a temporary host buffer, that will receive the results of both GPUs, all results are computed correctly: value_type* d_dec = thrust::raw_pointer_cast(&dec[0]); thrust::host_vector<value_type> h_dec(dec.size()); MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE, h_dec, columns, MPI_DOUBLE, communicator); dec = h_dec; //copy results back from host to device This lead me to the conclusion, that something with OMPIs CUDA IPC seems to cause the problems (synchronisation and/or fail-silent error) and indeed, disabling CUDA IPC : mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0 -np 2 ./double_test ../data/similarities20000.double.-300 ex.20000.double.2.gpus 1000 1000 0.9 will calculate correct results: # datatype: double # datapoints: 20000 # max_iterations: 1000 # conv_iterations: 1000 # damping: 0.9 # communicator.size: 2 # time elapsed [s]; iterations executed; convergent since; clusters identified 121.* 1000 807 20 121.* 1000 807 20 121.* 1000 807 20 121.* 1000 807 20 121.* 1000 807 20 Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is there a way to force synchronisation (I tried MPI_Barrier() without success), has anybody discovered similar problems? I posted some of the code to pastebin: http://pastebin.com/wCmc36k5 Thanks in advance, Christoph ----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------