Wonderful maybe but extremely unportable. Thanks but no thanks! George.
On Wed, Nov 27, 2019 at 11:07 PM Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If > MPI nonblocking calls could take an extra stream argument and work like a > kernel launch, it would be wonderful. > --Junchao Zhang > > > On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd <josh...@mellanox.com> wrote: > >> Why not spawn num_threads, where num_threads is the number of Kernels to >> launch , and compile with the “--default-stream per-thread” option? >> >> >> >> Then you could use MPI in thread multiple mode to achieve your objective. >> >> >> >> Something like: >> >> >> >> >> >> >> >> void *launch_kernel(void *dummy) >> >> { >> >> float *data; >> >> cudaMalloc(&data, N * sizeof(float)); >> >> >> >> kernel<<<XX, YY>>>(data, N); >> >> >> >> cudaStreamSynchronize(0); >> >> >> >> MPI_Isend(data,..); >> >> return NULL; >> >> } >> >> >> >> int main() >> >> { >> >> MPI_init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided); >> >> const int num_threads = 8; >> >> >> >> pthread_t threads[num_threads]; >> >> >> >> for (int i = 0; i < num_threads; i++) { >> >> if (pthread_create(&threads[i], NULL, launch_kernel, 0)) { >> >> fprintf(stderr, "Error creating threadn"); >> >> return 1; >> >> } >> >> } >> >> >> >> for (int i = 0; i < num_threads; i++) { >> >> if(pthread_join(threads[i], NULL)) { >> >> fprintf(stderr, "Error joining threadn"); >> >> return 2; >> >> } >> >> } >> >> cudaDeviceReset(); >> >> >> >> MPI_Finalize(); >> >> } >> >> >> >> >> >> >> >> >> >> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Zhang, >> Junchao via users >> *Sent:* Wednesday, November 27, 2019 5:43 PM >> *To:* George Bosilca <bosi...@icl.utk.edu> >> *Cc:* Zhang, Junchao <jczh...@mcs.anl.gov>; Open MPI Users < >> users@lists.open-mpi.org> >> *Subject:* Re: [OMPI users] CUDA mpi question >> >> >> >> I was pointed to "2.7. Synchronization and Memory Ordering" of >> https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf >> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fpdf%2FGPUDirect_RDMA.pdf&data=02%7C01%7Cjoshual%40mellanox.com%7C49083a368cab46dbbed908d7738b9386%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637104915623515051&sdata=OLP7ptjhpg3Esqzff9g7%2B7hKWH6xsdRY6HjU2RL01Z8%3D&reserved=0>. >> It is on topic. But unfortunately it is too short and I could not >> understand it. >> >> I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the >> host function "must not make any CUDA API calls". I am not sure if >> MPI_Isend qualifies as such functions. >> >> --Junchao Zhang >> >> >> >> >> >> On Wed, Nov 27, 2019 at 4:18 PM George Bosilca <bosi...@icl.utk.edu> >> wrote: >> >> On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao <jczh...@mcs.anl.gov> >> wrote: >> >> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca <bosi...@icl.utk.edu> >> wrote: >> >> Short and portable answer: you need to sync before the Isend or you will >> send garbage data. >> >> Ideally, I want to formulate my code into a series of asynchronous >> "kernel launch, kernel launch, ..." without synchronization, so that I can >> hide kernel launch overhead. It now seems I have to sync before MPI calls >> (even nonblocking calls) >> >> >> >> Then you need a means to ensure sequential execution, and this is what >> the streams provide. Unfortunately, I looked into the code and I'm afraid >> there is currently no realistic way to do what you need. My previous >> comment was based on an older code, that seems to be 1) >> unmaintained currently, and 2) only applicable to the OB1 PML + OpenIB BTL >> combo. As recent versions of OMPI have moved away from the OpenIB BTL, >> relying more heavily on UCX for Infiniband support, the old code is now >> deprecated. Sorry for giving you hope on this. >> >> >> >> Maybe you can delegate the MPI call into a CUDA event callback ? >> >> >> >> George. >> >> >> >> >> >> >> >> >> >> Assuming you are willing to go for a less portable solution you can get >> the OMPI streams and add your kernels inside, so that the sequential order >> will guarantee correctness of your isend. We have 2 hidden CUDA streams in >> OMPI, one for device-to-host and one for host-to-device, that can be >> queried with the non-MPI standard compliant functions >> (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream). >> >> >> >> Which streams (dtoh or htod) should I use to insert kernels producing >> send data and kernels using received data? I imagine MPI uses GPUDirect >> RDMA to move data directly from GPU to NIC. Why do we need to bother dtoh >> or htod streams? >> >> >> >> George. >> >> >> >> >> >> On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users < >> users@lists.open-mpi.org> wrote: >> >> Hi, >> >> Suppose I have this piece of code and I use cuda-aware MPI, >> >> cudaMalloc(&sbuf,sz); >> >> Kernel1<<<...,stream>>>(...,sbuf); >> >> MPI_Isend(sbuf,...); >> >> Kernel2<<<...,stream>>>(); >> >> >> >> Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to >> make sure data in sbuf is ready to send? If not, why? >> >> >> >> Thank you. >> >> >> >> --Junchao Zhang >> >>