Hi Barry, yes that's exactly the setup, multiple processes share a single physical GPU via MPS, and the GPUs are assigned upfront to guarantee fair balance.
I’ve looked further into this, and the behavior seems to be related to the problem size in my application. When I increase the number of DOFs, I no longer observe any slowdown with multiple MPI processes per GPU. I should also mention that I’m compiling PETSc without GPU‑aware MPI. I know this is not recommended, so my results may not be fully representative. Unfortunately, due to constraints in the toolchain I can use, this is the only way I can compile PETSc for the time being. I can also reproduce the issue on a single GPU, but only for relatively small problems. For example, with about 2e6 DOFs, going from 4 to 8 MPI processes introduces a noticeable performance penalty on the GPU (while the same configuration still scales reasonably well on the CPU). I’ve attached the -log_view outputs for the 1‑, 4‑, and 8‑process cases for this setup. Since this degradation only shows up for smaller DOF counts, it sounds more like I’m misusing the library (or operating in a regime where overheads dominate). Based on this, my tentative conclusion is that, in general, using a communicator that maps one MPI process per GPU is a better approach. Would you consider that a fair statement? Thanks, Gabriele ________________________________ From: Barry Smith <[email protected]> Sent: Tuesday, January 20, 2026 4:14 PM To: Gabriele Penazzi <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: [petsc-users] Performance with GPU and multiple MPI processes per GPU Let me try to understand your setup. You have two physical GPUs and a CPU with at least 16 physical cores? You run with 16 MPI processes, each using its own "virtual" GPU (via MPS). Thus, a single physical GPU is shared by 8 MPI processes? What happens if you run with 4 MPI processes, compared with 2? Can you run with -log_view and send the output when using 2, 4, and 8 MPI processes? Barry On Jan 19, 2026, at 5:52 AM, Gabriele Penazzi via petsc-users <[email protected]> wrote: Hi. I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), on multiple GPUs and multiple MPI processes. I noticed that the performances degrade significantly when using multiple MPI processes per GPU, compared to using a single process per GPU. For example, 2 GPUs with 2 MPI processes will be about 40% faster than running the same calculation with 2 GPUs and 16 MPI processes. I would assume the natural MPI/GPU affinity would be 1-1, however the rest of my application can benefit from multiple MPI processes driving GPU via nvidia MPS, therefore I am trying to understand if this is expected, if I am possibly missing something in the initialization/setup, or if my best choice is to constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I could not find explicit information about it in the manual. Is there any user or maintainer who can tell me more about this use case? Best Regards, Gabriele Penazzi
1proc_gpu.log
Description: 1proc_gpu.log
4proc_gpu.log
Description: 4proc_gpu.log
8proc_gpu.log
Description: 8proc_gpu.log
