Hi, Grant, I could reproduce the issue with your code. I think petsc code has some problems and I created an issue at https://urldefense.us/v3/__https://gitlab.com/petsc/petsc/-/issues/1826__;!!G_uCfscf7eWS!ZSsk7IMQF7yL-THgMdfh_H3K7F1HUJg38n2dhkaBkJR1IvhSOpfX3c1TZLEL6JDNyCGACV-PEFWtIy-WgsKA8roDoTvm$ . Though we should fix it (not sure how for now), I think a much simpler approach is to use CUDA_VISIBLE_DEVICES. For example, if you just want ranks 0, 4 to use GPUs 0, 1 respectively, you can just delete these lines in your example if (global_rank == 0) { cudaSetDevice(0); } else if (global_rank == 4) { cudaSetDevice(1); }
Then, instead, just make GPUs 0, 1 visible to ranks 0, 4 respectively upfront, by $ cat set_gpu_device #!/bin/bash # select_gpu_device wrapper script export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/2))) exec $* $ mpirun -n 8 ./set_gpu_device ./ex0 [Rank 5] no computation assigned. [Rank 6] no computation assigned. [Rank 7] no computation assigned. [Rank 0] using GPU 0, [line 23]. [Rank 0] using GPU 0, [line 32] after setdevice. [Rank 1] no computation assigned. [Rank 2] no computation assigned. [Rank 3] no computation assigned. [Rank 4] using GPU 0, [line 23]. [Rank 4] using GPU 0, [line 32] after setdevice. [Rank 0] using GPU 0, [line 42] after create A. [Rank 4] using GPU 0, [line 42] after create A. [Rank 4] using GPU 0, [line 46] after set A type. [Rank 0] using GPU 0, [line 46] after set A type. [Rank 0] using GPU 0, [line 50] after MatSetUp. [Rank 4] using GPU 0, [line 50] after MatSetUp. [Rank 0] using GPU 0, [line 63] after Mat Assemble. [Rank 4] using GPU 0, [line 63] after Mat Assemble. Smallest eigenvalue = 100.000000 Smallest eigenvalue = 100.000000 Note for rank 4, GPU 0 is actually the physical GPU 1. Let me know if it works. --Junchao Zhang On Thu, Nov 13, 2025 at 11:17 AM Grant Chao <[email protected]> wrote: > Junchao, > We have tried cudaSetDevice. > The test code is attached. 8 cpu and 2 gpu are used. And we create a > gpu_comm including rank 0 and rank 4. > Then we set gpu 0 to rank 0, gpu 1 to rank 1 respectively. > After MatSetType, rank 1 is mapped to gpu0 again. > > The run cmd is > mpirun -n 8 ./a.out -eps_type jd -st_ksp_type gmres -st_pc_type none > > The std out is show below, > [Rank 0] using GPU 0, [line 22]. > [Rank 1] no computation assigned. > [Rank 2] no computation assigned. > [Rank 3] no computation assigned. > [Rank 4] using GPU 0, [line 22]. > [Rank 5] no computation assigned. > [Rank 6] no computation assigned. > [Rank 7] no computation assigned. > [Rank 4] using GPU 1, [line 31] after setdevice. -------- Here set > device successfully > [Rank 0] using GPU 0, [line 31] after setdevice. > [Rank 4] using GPU 1, [line 41] after create A. > [Rank 0] using GPU 0, [line 41] after create A. > [Rank 0] using GPU 0, [line 45] after set A type. > [Rank 4] using GPU 0, [line 45] after set A type. ------ change to 0? > [Rank 4] using GPU 0, [line 49] after MatSetUp. > [Rank 0] using GPU 0, [line 49] after MatSetUp. > [Rank 4] using GPU 0, [line 62] after Mat Assemble. > [Rank 0] using GPU 0, [line 62] after Mat Assemble. > Smallest eigenvalue = 100.000000 > Smallest eigenvalue = 100.000000 > > BEST, > Grant > > > > > At 2025-11-13 05:58:05, "Junchao Zhang" <[email protected]> wrote: > > A common approach is to use CUDA_VISIBLE_DEVICES to manipulate MPI ranks > to GPUs mapping, see the section at > https://urldefense.us/v3/__https://docs.nersc.gov/jobs/affinity/*gpu-nodes__;Iw!!G_uCfscf7eWS!ZSsk7IMQF7yL-THgMdfh_H3K7F1HUJg38n2dhkaBkJR1IvhSOpfX3c1TZLEL6JDNyCGACV-PEFWtIy-WgsKA8pWxGvch$ > > > With OpenMPI, you can use OMPI_COMM_WORLD_LOCAL_RANK in place of > SLURM_LOCALID (see > https://urldefense.us/v3/__https://docs.open-mpi.org/en/v5.0.x/tuning-apps/environment-var.html__;!!G_uCfscf7eWS!ZSsk7IMQF7yL-THgMdfh_H3K7F1HUJg38n2dhkaBkJR1IvhSOpfX3c1TZLEL6JDNyCGACV-PEFWtIy-WgsKA8khuXtvj$ > ). > For example, with 8 MPI ranks and 4 GPUs per node, the following script > will map ranks 0, 1 to GPU 0, ranks 2, 3 to GPU 1. > > #!/bin/bash > # select_gpu_device wrapper script > export > CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK/(OMPI_COMM_WORLD_LOCAL_SIZE/4))) > exec $* > > On Wed, Nov 12, 2025 at 10:20 AM Barry Smith <[email protected]> wrote: > >> >> >> On Nov 12, 2025, at 2:31 AM, Grant Chao <[email protected]> wrote: >> >> >> Thank you for the suggestion. >> >> We have already tried running multiple CPU ranks with a single GPU. >> However, we observed that as the number of ranks increases, the EPS solver >> becomes significantly slower. We are not sure of the exact cause—could it >> be due to process access contention, hidden data transfers, or perhaps >> another reason? We would be very interested to hear your insight on this >> matter. >> >> To avoid this problem, we used the gpu_comm approach mentioned before. >> During testing, we noticed that the mapping between rank ID and GPU ID >> seems to be set automatically and is not user-specifiable. >> >> For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds >> ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on. >> >> >> >> >> We tested possible solutions, such as calling cudaSetDevice() manually to >> set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 >> still used GPU 0. >> >> We would appreciate your guidance on how to customize this mapping. Thank >> you for your support. >> >> >> So you have a single compute "node" connected to multiple GPUs? Then >> the mapping of MPI ranks to GPUs doesn't matter and changing it won't >> improve the performance. >> > >> However, we observed that as the number of ranks increases, the EPS >> solver becomes significantly slower. >> >> >> Does the number of EPS "iterations" increase? Run with one, two, four >> and eight MPI ranks (and the same number of "GPUs" (if you only have say >> four GPUs that is fine, just virtualize them so two different MPI ranks >> share one) and the option -log_view and send the output. We need to know >> what is slowing down before trying to find any cure. >> >> Barry >> >> >> >> >> >> Best wishes, >> Grant >> >> >> At 2025-11-12 11:48:47, "Junchao Zhang" <[email protected]>, said: >> >> Hi, Wenbo, >> I think your approach should work. But before going this extra step >> with gpu_comm, have you tried to map multiple MPI ranks (CPUs) to one GPU, >> using nvidia's multiple process service (MPS)? If MPS works well, then >> you can avoid the extra complexity. >> >> --Junchao Zhang >> >> >> On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected]> >> wrote: >> >>> Dear all, >>> >>> We are trying to solve ksp using GPUs. >>> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which >>> the matrix is created and assembling using COO way provided by PETSc. In >>> this example, the number of CPU is as same as the number of GPU. >>> In our case, computation of the parameters of matrix is performed on >>> CPUs. And the cost of it is expensive, which might take half of total time >>> or even more. >>> >>> We want to use more CPUs to compute parameters in parallel. And a >>> smaller communication domain (such as gpu_comm) for the CPUs corresponding >>> to the GPUs is created. The parameters are computed by all of the CPUs (in >>> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via >>> MPI. Matrix (type of aijcusparse) is then created and assembled within >>> gpu_comm. Finally, ksp_solve is performed on GPUs. >>> >>> I’m not sure if this approach will work in practice. Are there any >>> comparable examples I can look to for guidance? >>> >>> Best, >>> Wenbo >>> >> >>
