[petsc-users] Performance with GPU and multiple MPI processes per GPU

Gabriele Penazzi via petsc-users Tue, 20 Jan 2026 06:21:25 -0800

Hi.

I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), 
on multiple GPUs and multiple MPI processes.


I noticed that the performances degrade significantly when using multiple MPI 
processes per GPU, compared to using a single process per GPU.
For example, 2 GPUs with 2 MPI processes will be about 40% faster than running 
the same calculation with 2 GPUs and 16 MPI processes.

I would assume the natural MPI/GPU affinity would be 1-1, however the rest of 
my application can benefit from multiple MPI processes driving GPU via nvidia 
MPS, therefore I am trying to understand if this is expected, if I am possibly 
missing something in the initialization/setup, or if my best choice is to 
constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I 
could not find explicit information about it in the manual.

Is there any user or maintainer who can tell me more about this use case?

Best Regards,
Gabriele Penazzi

[petsc-users] Performance with GPU and multiple MPI processes per GPU

Reply via email to