On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)


Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or htod streams?

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
Hi,
  Suppose I have this piece of code and I use cuda-aware MPI,
              cudaMalloc(&sbuf,sz);
   Kernel1<<<...,stream>>>(...,sbuf);
   MPI_Isend(sbuf,...);
   Kernel2<<<...,stream>>>();

  Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make 
sure data in sbuf is ready to send?  If not, why?

  Thank you.

--Junchao Zhang

Reply via email to