Let me try to understand your setup. 

You have two physical GPUs and a CPU with at least 16 physical cores? 

You run with 16 MPI processes, each using its own "virtual" GPU (via MPS). 
Thus, a single physical GPU is shared by 8 MPI processes?

What happens if you run with 4 MPI processes, compared with 2? 

Can you run with -log_view and send the output when using 2, 4, and 8 MPI 
processes?  

Barry


> On Jan 19, 2026, at 5:52 AM, Gabriele Penazzi via petsc-users 
> <[email protected]> wrote:
> 
> Hi.
> 
> I am using PETSc conjugate gradient liner solver with GPU acceleration 
> (CUDA), on multiple GPUs and multiple MPI processes.
> 
> I noticed that the performances degrade significantly when using multiple MPI 
> processes per GPU, compared to using a single process per GPU.
> For example, 2 GPUs with 2 MPI processes will be about 40% faster than 
> running the same calculation with 2 GPUs and 16 MPI processes.
> 
> I would assume the natural MPI/GPU affinity would be 1-1, however the rest of 
> my application can benefit from multiple MPI processes driving GPU via nvidia 
> MPS, therefore I am trying to understand if this is expected, if I am 
> possibly missing something in the initialization/setup, or if my best choice 
> is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver 
> step. I could not find explicit information about it in the manual.
> 
> Is there any user or maintainer who can tell me more about this use case?
>  
> Best Regards,
> Gabriele Penazzi

Reply via email to