Hi.

I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), 
on multiple GPUs and multiple MPI processes.

I noticed that the performances degrade significantly when using multiple MPI 
processes per GPU, compared to using a single process per GPU.
For example, 2 GPUs with 2 MPI processes will be about 40% faster than running 
the same calculation with 2 GPUs and 16 MPI processes.

I would assume the natural MPI/GPU affinity would be 1-1, however the rest of 
my application can benefit from multiple MPI processes driving GPU via nvidia 
MPS, therefore I am trying to understand if this is expected, if I am possibly 
missing something in the initialization/setup, or if my best choice is to 
constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I 
could not find explicit information about it in the manual.

Is there any user or maintainer who can tell me more about this use case?

Best Regards,
Gabriele Penazzi





Reply via email to