A common approach is to use CUDA_VISIBLE_DEVICES to manipulate MPI ranks to GPUs mapping, see the section at https://urldefense.us/v3/__https://docs.nersc.gov/jobs/affinity/*gpu-nodes__;Iw!!G_uCfscf7eWS!ags1Nog_0A9TnDudT9S81jm72t1NQYuOCg3--XMIlL4LXQCv-SFhCbQzesjgOxMAaRoyDOeYcqInlCRwOorJ0HFSR5q_$
With OpenMPI, you can use OMPI_COMM_WORLD_LOCAL_RANK in place of SLURM_LOCALID (see https://urldefense.us/v3/__https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html__;!!G_uCfscf7eWS!ags1Nog_0A9TnDudT9S81jm72t1NQYuOCg3--XMIlL4LXQCv-SFhCbQzesjgOxMAaRoyDOeYcqInlCRwOorJ0DsDgr-l$ ). For example, with 8 MPI ranks and 4 GPUs per node, the following script will map ranks 0, 1 to GPU 0, ranks 2, 3 to GPU 1. #!/bin/bash # select_gpu_device wrapper script export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/4))) exec $* On Wed, Nov 12, 2025 at 10:20 AM Barry Smith <[email protected]> wrote: > > > On Nov 12, 2025, at 2:31 AM, Grant Chao <[email protected]> wrote: > > > Thank you for the suggestion. > > We have already tried running multiple CPU ranks with a single GPU. > However, we observed that as the number of ranks increases, the EPS solver > becomes significantly slower. We are not sure of the exact cause—could it > be due to process access contention, hidden data transfers, or perhaps > another reason? We would be very interested to hear your insight on this > matter. > > To avoid this problem, we used the gpu_comm approach mentioned before. > During testing, we noticed that the mapping between rank ID and GPU ID > seems to be set automatically and is not user-specifiable. > > For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds > ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on. > > > > > We tested possible solutions, such as calling cudaSetDevice() manually to > set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 > still used GPU 0. > > We would appreciate your guidance on how to customize this mapping. Thank > you for your support. > > > So you have a single compute "node" connected to multiple GPUs? Then > the mapping of MPI ranks to GPUs doesn't matter and changing it won't > improve the performance. > > However, we observed that as the number of ranks increases, the EPS solver > becomes significantly slower. > > > Does the number of EPS "iterations" increase? Run with one, two, four > and eight MPI ranks (and the same number of "GPUs" (if you only have say > four GPUs that is fine, just virtualize them so two different MPI ranks > share one) and the option -log_view and send the output. We need to know > what is slowing down before trying to find any cure. > > Barry > > > > > > Best wishes, > Grant > > > At 2025-11-12 11:48:47, "Junchao Zhang" <[email protected]>, said: > > Hi, Wenbo, > I think your approach should work. But before going this extra step > with gpu_comm, have you tried to map multiple MPI ranks (CPUs) to one GPU, > using nvidia's multiple process service (MPS)? If MPS works well, then > you can avoid the extra complexity. > > --Junchao Zhang > > > On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected]> > wrote: > >> Dear all, >> >> We are trying to solve ksp using GPUs. >> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which >> the matrix is created and assembling using COO way provided by PETSc. In >> this example, the number of CPU is as same as the number of GPU. >> In our case, computation of the parameters of matrix is performed on >> CPUs. And the cost of it is expensive, which might take half of total time >> or even more. >> >> We want to use more CPUs to compute parameters in parallel. And a >> smaller communication domain (such as gpu_comm) for the CPUs corresponding >> to the GPUs is created. The parameters are computed by all of the CPUs (in >> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via >> MPI. Matrix (type of aijcusparse) is then created and assembled within >> gpu_comm. Finally, ksp_solve is performed on GPUs. >> >> I’m not sure if this approach will work in practice. Are there any >> comparable examples I can look to for guidance? >> >> Best, >> Wenbo >> > >
