On Wed, Nov 27, 2019 at 3:16 PM George Bosilca <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote: Short and portable answer: you need to sync before the Isend or you will send garbage data. Ideally, I want to formulate my code into a series of asynchronous "kernel launch, kernel launch, ..." without synchronization, so that I can hide kernel launch overhead. It now seems I have to sync before MPI calls (even nonblocking calls)
Assuming you are willing to go for a less portable solution you can get the OMPI streams and add your kernels inside, so that the sequential order will guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one for device-to-host and one for host-to-device, that can be queried with the non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream). Which streams (dtoh or htod) should I use to insert kernels producing send data and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data directly from GPU to NIC. Why do we need to bother dtoh or htod streams? George. On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: Hi, Suppose I have this piece of code and I use cuda-aware MPI, cudaMalloc(&sbuf,sz); Kernel1<<<...,stream>>>(...,sbuf); MPI_Isend(sbuf,...); Kernel2<<<...,stream>>>(); Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make sure data in sbuf is ready to send? If not, why? Thank you. --Junchao Zhang