Hi, Wenbo, I think your approach should work. But before going this extra step with gpu_comm, have you tried to map multiple MPI ranks (CPUs) to one GPU, using nvidia's multiple process service (MPS)? If MPS works well, then you can avoid the extra complexity.
--Junchao Zhang On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected]> wrote: > Dear all, > > We are trying to solve ksp using GPUs. > We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which the > matrix is created and assembling using COO way provided by PETSc. In this > example, the number of CPU is as same as the number of GPU. > In our case, computation of the parameters of matrix is performed on CPUs. > And the cost of it is expensive, which might take half of total time or > even more. > > We want to use more CPUs to compute parameters in parallel. And a smaller > communication domain (such as gpu_comm) for the CPUs corresponding to the > GPUs is created. The parameters are computed by all of the CPUs (in > MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via > MPI. Matrix (type of aijcusparse) is then created and assembled within > gpu_comm. Finally, ksp_solve is performed on GPUs. > > I’m not sure if this approach will work in practice. Are there any > comparable examples I can look to for guidance? > > Best, > Wenbo >
