Wonderful maybe but extremely unportable. Thanks but no thanks!

  George.

On Wed, Nov 27, 2019 at 11:07 PM Zhang, Junchao <jczh...@mcs.anl.gov> wrote:

> Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If
> MPI nonblocking calls could take an extra stream argument and work like a
> kernel launch, it would be wonderful.
> --Junchao Zhang
>
>
> On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd <josh...@mellanox.com> wrote:
>
>> Why not spawn num_threads, where num_threads is the number of Kernels to
>> launch , and compile with the “--default-stream per-thread” option?
>>
>>
>>
>> Then you could use MPI in thread multiple mode to achieve your objective.
>>
>>
>>
>> Something like:
>>
>>
>>
>>
>>
>>
>>
>> void *launch_kernel(void *dummy)
>>
>> {
>>
>>     float *data;
>>
>>     cudaMalloc(&data, N * sizeof(float));
>>
>>
>>
>>     kernel<<<XX, YY>>>(data, N);
>>
>>
>>
>>     cudaStreamSynchronize(0);
>>
>>
>>
>>     MPI_Isend(data,..);
>>
>>     return NULL;
>>
>> }
>>
>>
>>
>> int main()
>>
>> {
>>
>>     MPI_init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided);
>>
>>     const int num_threads = 8;
>>
>>
>>
>>     pthread_t threads[num_threads];
>>
>>
>>
>>     for (int i = 0; i < num_threads; i++) {
>>
>>         if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
>>
>>             fprintf(stderr, "Error creating threadn");
>>
>>             return 1;
>>
>>         }
>>
>>     }
>>
>>
>>
>>     for (int i = 0; i < num_threads; i++) {
>>
>>         if(pthread_join(threads[i], NULL)) {
>>
>>             fprintf(stderr, "Error joining threadn");
>>
>>             return 2;
>>
>>         }
>>
>>     }
>>
>>     cudaDeviceReset();
>>
>>
>>
>>     MPI_Finalize();
>>
>> }
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Zhang,
>> Junchao via users
>> *Sent:* Wednesday, November 27, 2019 5:43 PM
>> *To:* George Bosilca <bosi...@icl.utk.edu>
>> *Cc:* Zhang, Junchao <jczh...@mcs.anl.gov>; Open MPI Users <
>> users@lists.open-mpi.org>
>> *Subject:* Re: [OMPI users] CUDA mpi question
>>
>>
>>
>> I was pointed to "2.7. Synchronization and Memory Ordering" of
>> https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf
>> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.nvidia.com%2Fpdf%2FGPUDirect_RDMA.pdf&data=02%7C01%7Cjoshual%40mellanox.com%7C49083a368cab46dbbed908d7738b9386%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637104915623515051&sdata=OLP7ptjhpg3Esqzff9g7%2B7hKWH6xsdRY6HjU2RL01Z8%3D&reserved=0>.
>> It is on topic. But unfortunately it is too short and I could not
>> understand it.
>>
>> I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the
>> host function "must not make any CUDA API calls". I am not sure if
>> MPI_Isend qualifies as such functions.
>>
>> --Junchao Zhang
>>
>>
>>
>>
>>
>> On Wed, Nov 27, 2019 at 4:18 PM George Bosilca <bosi...@icl.utk.edu>
>> wrote:
>>
>> On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao <jczh...@mcs.anl.gov>
>> wrote:
>>
>> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca <bosi...@icl.utk.edu>
>> wrote:
>>
>> Short and portable answer: you need to sync before the Isend or you will
>> send garbage data.
>>
>> Ideally, I want to formulate my code into a series of asynchronous
>> "kernel launch, kernel launch, ..." without synchronization, so that I can
>> hide kernel launch overhead. It now seems I have to sync before MPI calls
>> (even nonblocking calls)
>>
>>
>>
>> Then you need a means to ensure sequential execution, and this is what
>> the streams provide. Unfortunately, I looked into the code and I'm afraid
>> there is currently no realistic way to do what you need. My previous
>> comment was based on an older code, that seems to be 1)
>> unmaintained currently, and 2) only applicable to the OB1 PML + OpenIB BTL
>> combo. As recent versions of OMPI have moved away from the OpenIB BTL,
>> relying more heavily on UCX for Infiniband support, the old code is now
>> deprecated. Sorry for giving you hope on this.
>>
>>
>>
>> Maybe you can delegate the MPI call into a CUDA event callback ?
>>
>>
>>
>>   George.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Assuming you are willing to go for a less portable solution you can get
>> the OMPI streams and add your kernels inside, so that the sequential order
>> will guarantee correctness of your isend. We have 2 hidden CUDA streams in
>> OMPI, one for device-to-host and one for host-to-device, that can be
>> queried with the non-MPI standard compliant functions
>> (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream).
>>
>>
>>
>> Which streams (dtoh or htod) should I use to insert kernels producing
>> send data and kernels using received data? I imagine MPI uses GPUDirect
>> RDMA to move data directly from GPU to NIC. Why do we need to bother dtoh
>> or htod streams?
>>
>>
>>
>> George.
>>
>>
>>
>>
>>
>> On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <
>> users@lists.open-mpi.org> wrote:
>>
>> Hi,
>>
>>   Suppose I have this piece of code and I use cuda-aware MPI,
>>
>>               cudaMalloc(&sbuf,sz);
>>
>>    Kernel1<<<...,stream>>>(...,sbuf);
>>
>>    MPI_Isend(sbuf,...);
>>
>>    Kernel2<<<...,stream>>>();
>>
>>
>>
>>   Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to
>> make sure data in sbuf is ready to send?  If not, why?
>>
>>
>>
>>   Thank you.
>>
>>
>>
>> --Junchao Zhang
>>
>>

Reply via email to