Junchao,
We have tried cudaSetDevice.
The test code is attached. 8 cpu and 2 gpu are used. And we create a gpu_comm
including rank 0 and rank 4.
Then we set gpu 0 to rank 0, gpu 1 to rank 1 respectively.
After MatSetType, rank 1 is mapped to gpu0 again.
The run cmd is
mpirun -n 8 ./a.out -eps_type jd -st_ksp_type gmres -st_pc_type none
The std out is show below,
[Rank 0] using GPU 0, [line 22].
[Rank 1] no computation assigned.
[Rank 2] no computation assigned.
[Rank 3] no computation assigned.
[Rank 4] using GPU 0, [line 22].
[Rank 5] no computation assigned.
[Rank 6] no computation assigned.
[Rank 7] no computation assigned.
[Rank 4] using GPU 1, [line 31] after setdevice. -------- Here set device
successfully
[Rank 0] using GPU 0, [line 31] after setdevice.
[Rank 4] using GPU 1, [line 41] after create A.
[Rank 0] using GPU 0, [line 41] after create A.
[Rank 0] using GPU 0, [line 45] after set A type.
[Rank 4] using GPU 0, [line 45] after set A type. ------ change to 0?
[Rank 4] using GPU 0, [line 49] after MatSetUp.
[Rank 0] using GPU 0, [line 49] after MatSetUp.
[Rank 4] using GPU 0, [line 62] after Mat Assemble.
[Rank 0] using GPU 0, [line 62] after Mat Assemble.
Smallest eigenvalue = 100.000000
Smallest eigenvalue = 100.000000
BEST,
Grant
At 2025-11-13 05:58:05, "Junchao Zhang" <[email protected]> wrote:
A common approach is to use CUDA_VISIBLE_DEVICES to manipulate MPI ranks to
GPUs mapping, see the section at
https://urldefense.us/v3/__https://docs.nersc.gov/jobs/affinity/*gpu-nodes__;Iw!!G_uCfscf7eWS!Z_gIM7FfeDHQ5dHmPBQcDcmQnG0t6iMrPQU7OgVoGBU_BV3clXDllaQuK7A2zJlgP_o477Up1LHyn0VK4A3ULkoO7PrHMQ$
With OpenMPI, you can use OMPI_COMM_WORLD_LOCAL_RANK in place of SLURM_LOCALID
(see
https://urldefense.us/v3/__https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html__;!!G_uCfscf7eWS!Z_gIM7FfeDHQ5dHmPBQcDcmQnG0t6iMrPQU7OgVoGBU_BV3clXDllaQuK7A2zJlgP_o477Up1LHyn0VK4A3ULkpfQizn9g$
). For example, with 8 MPI ranks and 4 GPUs per node, the following script
will map ranks 0, 1 to GPU 0, ranks 2, 3 to GPU 1.
#!/bin/bash
# select_gpu_device wrapper script
export
CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/4)))
exec $*
On Wed, Nov 12, 2025 at 10:20 AM Barry Smith <[email protected]> wrote:
On Nov 12, 2025, at 2:31 AM, Grant Chao <[email protected]> wrote:
Thank you for the suggestion.
We have already tried running multiple CPU ranks with a single GPU. However, we
observed that as the number of ranks increases, the EPS solver becomes
significantly slower. We are not sure of the exact cause—could it be due to
process access contention, hidden data transfers, or perhaps another reason? We
would be very interested to hear your insight on this matter.
To avoid this problem, we used the gpu_comm approach mentioned before. During
testing, we noticed that the mapping between rank ID and GPU ID seems to be set
automatically and is not user-specifiable.
For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds ranks 0
and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.
We tested possible solutions, such as calling cudaSetDevice() manually to set
rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 still used
GPU 0.
We would appreciate your guidance on how to customize this mapping. Thank you
for your support.
So you have a single compute "node" connected to multiple GPUs? Then the
mapping of MPI ranks to GPUs doesn't matter and changing it won't improve the
performance.
However, we observed that as the number of ranks increases, the EPS solver
becomes significantly slower.
Does the number of EPS "iterations" increase? Run with one, two, four and
eight MPI ranks (and the same number of "GPUs" (if you only have say four GPUs
that is fine, just virtualize them so two different MPI ranks share one) and
the option -log_view and send the output. We need to know what is slowing down
before trying to find any cure.
Barry
Best wishes,
Grant
At 2025-11-12 11:48:47, "Junchao Zhang" <[email protected]>, said:
Hi, Wenbo,
I think your approach should work. But before going this extra step with
gpu_comm, have you tried to map multiple MPI ranks (CPUs) to one GPU, using
nvidia's multiple process service (MPS)? If MPS works well, then you can
avoid the extra complexity.
--Junchao Zhang
On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected]> wrote:
Dear all,
We are trying to solve ksp using GPUs.
We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which the
matrix is created and assembling using COO way provided by PETSc. In this
example, the number of CPU is as same as the number of GPU.
In our case, computation of the parameters of matrix is performed on CPUs. And
the cost of it is expensive, which might take half of total time or even more.
We want to use more CPUs to compute parameters in parallel. And a smaller
communication domain (such as gpu_comm) for the CPUs corresponding to the GPUs
is created. The parameters are computed by all of the CPUs (in MPI_COMM_WORLD).
Then, the parameters are send to gpu_comm related CPUs via MPI. Matrix (type of
aijcusparse) is then created and assembled within gpu_comm. Finally, ksp_solve
is performed on GPUs.
I’m not sure if this approach will work in practice. Are there any comparable
examples I can look to for guidance?
Best,
Wenbo
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <slepceps.h>
#include <cuda_runtime.h>
int main(int argc, char **argv) {
SlepcInitialize(&argc, &argv, NULL, NULL);
MPI_Comm global_comm = PETSC_COMM_WORLD;
MPI_Comm sub_comm;
int global_rank, global_size;
MPI_Comm_rank(global_comm, &global_rank);
MPI_Comm_size(global_comm, &global_size);
// Create sub-communicator for ranks 0 and 4
int color = (global_rank == 0 || global_rank == 4) ? 1 : MPI_UNDEFINED;
MPI_Comm_split(global_comm, color, global_rank, &sub_comm);
int dev;
// Only ranks in sub-communicator work on EPS problem
if (color == 1) {
cudaGetDevice(&dev);
printf("[Rank %d] using GPU %d, [line %d].\n",global_rank,dev,__LINE__);
if(global_rank==0) {
cudaSetDevice(0);
}else if(global_rank==4){
cudaSetDevice(1);
}
cudaGetDevice(&dev);
printf("[Rank %d] using GPU %d, [line %d] after setdevice.\n",global_rank,dev,__LINE__);
EPS eps;
Mat A;
PetscInt n = 100; // Small matrix for simplicity
PetscInt Istart, Iend, i;
// Create and setup matrix
MatCreate(sub_comm, &A);
MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n, n);
cudaGetDevice(&dev);
printf("[Rank %d] using GPU %d, [line %d] after create A.\n",global_rank,dev,__LINE__);
MatSetType(A,MATAIJCUSPARSE);
cudaGetDevice(&dev);
printf("[Rank %d] using GPU %d, [line %d] after set A type.\n",global_rank,dev,__LINE__);
MatSetUp(A);
cudaGetDevice(&dev);
printf("[Rank %d] using GPU %d, [line %d] after MatSetUp.\n",global_rank,dev,__LINE__);
MatGetOwnershipRange(A, &Istart, &Iend);
// Set matrix entries (simple diagonal matrix)
for (i = Istart; i < Iend; i++) {
PetscScalar v = i + 1.0; // Eigenvalues are 1, 2, 3, ..., n
MatSetValue(A, i, i, v, INSERT_VALUES);
}
MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
cudaGetDevice(&dev);
printf("[Rank %d] using GPU %d, [line %d] after Mat Assemble.\n",global_rank,dev,__LINE__);
// Create and solve EPS
EPSCreate(sub_comm, &eps);
EPSSetOperators(eps, A, NULL);
EPSSetProblemType(eps, EPS_HEP);
EPSSetFromOptions(eps);
EPSSolve(eps);
// Print smallest eigenvalue
PetscInt nconv;
EPSGetConverged(eps, &nconv);
if (nconv > 0) {
PetscReal kr, ki;
EPSGetEigenvalue(eps, 0, &kr, &ki);
printf("Smallest eigenvalue = %f\n", (double)PetscRealPart(kr));
}
// Cleanup
MatDestroy(&A);
EPSDestroy(&eps);
}
else
{
printf("[Rank %d] no computation assigned.\n",global_rank);
}
MPI_Barrier(global_comm);
// Free sub-communicator if created
if (sub_comm != MPI_COMM_NULL) {
MPI_Comm_free(&sub_comm);
}
SlepcFinalize();
return 0;
}