Re: [petsc-users] gpu cpu parallel

Barry Smith Wed, 12 Nov 2025 08:21:12 -0800


> On Nov 12, 2025, at 2:31 AM, Grant Chao <[email protected]> wrote:
> 
> 
> Thank you for the suggestion.
> 
> We have already tried running multiple CPU ranks with a single GPU. However, 
> we observed that as the number of ranks increases, the EPS solver becomes 
> significantly slower. We are not sure of the exact cause—could it be due to 
> process access contention, hidden data transfers, or perhaps another reason? 
> We would be very interested to hear your insight on this matter.
> 
> To avoid this problem, we used the gpu_comm approach mentioned before. During 
> testing, we noticed that the mapping between rank ID and GPU ID seems to be 
> set automatically and is not user-specifiable.
> 
> For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds ranks 
> 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.


 
> We tested possible solutions, such as calling cudaSetDevice() manually to set 
> rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 still used 
> GPU 0.
> 
> We would appreciate your guidance on how to customize this mapping. Thank you 
> for your support.

  So you have a single compute "node" connected to multiple GPUs?  Then the 
mapping of MPI ranks to GPUs doesn't matter and changing it won't improve the 
performance.

> However, we observed that as the number of ranks increases, the EPS solver 
> becomes significantly slower.

  Does the number of EPS "iterations" increase? Run with one, two, four and 
eight MPI ranks (and the same number of "GPUs" (if you only have say four GPUs 
that is fine, just virtualize them so two different MPI ranks share one) and 
the option -log_view and send the output. We need to know what is slowing down 
before trying to find any cure.

  Barry




> 
> Best wishes,
> Grant
> 
> 
> At 2025-11-12 11:48:47, "Junchao Zhang" <[email protected]>, said:
> Hi, Wenbo,
>    I think your approach should work.  But before going this extra step with 
> gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU, using 
> nvidia's multiple process service (MPS)?  If MPS works well,  then you can 
> avoid the extra complexity. 
> 
> --Junchao Zhang
> 
> 
> On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected] 
> <mailto:[email protected]>> wrote:
>> Dear all,
>> 
>> We are trying to solve ksp using GPUs.
>> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which the 
>> matrix is created and assembling using COO way provided by PETSc. In this 
>> example, the number of CPU is as same as the number of GPU.
>> In our case, computation of the parameters of matrix is performed on CPUs. 
>> And the cost of it is expensive, which might take half of total time or even 
>> more. 
>> 
>>  We want to use more CPUs to compute parameters in parallel. And a smaller 
>> communication domain (such as gpu_comm) for the CPUs corresponding to the 
>> GPUs is created. The parameters are computed by all of the CPUs (in 
>> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via 
>> MPI. Matrix (type of aijcusparse) is then created and assembled within 
>> gpu_comm. Finally, ksp_solve is performed on GPUs.
>> 
>> I’m not sure if this approach will work in practice. Are there any 
>> comparable examples I can look to for guidance?
>> 
>> Best,
>> Wenbo

Re: [petsc-users] gpu cpu parallel

Reply via email to