Hi Junchao,

I am already using MPS, but thanks for the suggestion.
It does make a large difference indeed, I think in general it'd be a very 
useful documentation entry

Thank you,
Gabriele

________________________________
From: Junchao Zhang <[email protected]>
Sent: Tuesday, January 20, 2026 5:17 PM
To: Gabriele Penazzi <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [petsc-users] Performance with GPU and multiple MPI processes per 
GPU

Hello Babriele,
  Maybe you can try CUDA MPS service, to effectively map multiple processes to 
one GPU.  First, I would create a directory $HOME/tmp/nvidia-mps  (by default, 
cuda will use /tmp/nvidia-mps), then use these steps

export CUDA_MPS_PIPE_DIRECTORY=$HOME/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=$HOME/tmp/nvidia-mps

# Start MPS
nvidia-cuda-mps-control -d

# run the test
mpiexec -n 16 ./test

# shut down MPS
echo quit | nvidia-cuda-mps-control

I would also like to block-map MPI processes to GPUs manually via manipulating 
the env var CUDA_VISIBLE_DEVICES.   So I have this bash script 
set_gpu_device.sh on my PATH (assume you use OpenMPI)

#!/bin/bash
GPUS_PER_NODE=2
export 
CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/GPUS_PER_NODE)))
exec $*

In other words, to run the test, I use

mpiexec -n 16 set_gpu_device.sh ./test

Let us know if it helps so that we can add the instructions to the PETSc doc.

Thanks.
--Junchao Zhang


On Tue, Jan 20, 2026 at 8:21 AM Gabriele Penazzi via petsc-users 
<[email protected]<mailto:[email protected]>> wrote:
Hi.

I am using PETSc conjugate gradient liner solver with GPU acceleration (CUDA), 
on multiple GPUs and multiple MPI processes.

I noticed that the performances degrade significantly when using multiple MPI 
processes per GPU, compared to using a single process per GPU.
For example, 2 GPUs with 2 MPI processes will be about 40% faster than running 
the same calculation with 2 GPUs and 16 MPI processes.

I would assume the natural MPI/GPU affinity would be 1-1, however the rest of 
my application can benefit from multiple MPI processes driving GPU via nvidia 
MPS, therefore I am trying to understand if this is expected, if I am possibly 
missing something in the initialization/setup, or if my best choice is to 
constrain 1-1 MPI/GPU access especially for the PETSc linear solver step. I 
could not find explicit information about it in the manual.

Is there any user or maintainer who can tell me more about this use case?

Best Regards,
Gabriele Penazzi





Reply via email to