Hi Barry,

yes that's exactly the setup, multiple processes share a single physical GPU 
via MPS, and the GPUs are assigned upfront to guarantee fair balance.

I’ve looked further into this, and the behavior seems to be related to the 
problem size in my application. When I increase the number of DOFs, I no longer 
observe any slowdown with multiple MPI processes per GPU.

I should also mention that I’m compiling PETSc without GPU‑aware MPI. I know 
this is not recommended, so my results may not be fully representative. 
Unfortunately, due to constraints in the toolchain I can use, this is the only 
way I can compile PETSc for the time being.

I can also reproduce the issue on a single GPU, but only for relatively small 
problems. For example, with about 2e6 DOFs, going from 4 to 8 MPI processes 
introduces a noticeable performance penalty on the GPU (while the same 
configuration still scales reasonably well on the CPU). I’ve attached the 
-log_view outputs for the 1‑, 4‑, and 8‑process cases for this setup.
Since this degradation only shows up for smaller DOF counts, it sounds more 
like I’m misusing the library (or operating in a regime where overheads 
dominate).
Based on this, my tentative conclusion is that, in general, using a 
communicator that maps one MPI process per GPU is a better approach. Would you 
consider that a fair statement?
Thanks,
Gabriele



________________________________
From: Barry Smith <[email protected]>
Sent: Tuesday, January 20, 2026 4:14 PM
To: Gabriele Penazzi <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [petsc-users] Performance with GPU and multiple MPI processes per 
GPU

 Let me try to understand your setup.

You have two physical GPUs and a CPU with at least 16 physical cores?

You run with 16 MPI processes, each using its own "virtual" GPU (via MPS). 
Thus, a single physical GPU is shared by 8 MPI processes?

What happens if you run with 4 MPI processes, compared with 2?

Can you run with -log_view and send the output when using 2, 4, and 8 MPI 
processes?

Barry


On Jan 19, 2026, at 5:52 AM, Gabriele Penazzi via petsc-users 
<[email protected]> wrote:

Hi.

I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), 
on multiple GPUs and multiple MPI processes.

I noticed that the performances degrade significantly when using multiple MPI 
processes per GPU, compared to using a single process per GPU.
For example, 2 GPUs with 2 MPI processes will be about 40% faster than running 
the same calculation with 2 GPUs and 16 MPI processes.

I would assume the natural MPI/GPU affinity would be 1-1, however the rest of 
my application can benefit from multiple MPI processes driving GPU via nvidia 
MPS, therefore I am trying to understand if this is expected, if I am possibly 
missing something in the initialization/setup, or if my best choice is to 
constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I 
could not find explicit information about it in the manual.

Is there any user or maintainer who can tell me more about this use case?

Best Regards,
Gabriele Penazzi

Attachment: 1proc_gpu.log
Description: 1proc_gpu.log

Attachment: 4proc_gpu.log
Description: 4proc_gpu.log

Attachment: 8proc_gpu.log
Description: 8proc_gpu.log

Reply via email to