Hello, David,
   It took a longer time than I expected to add the CUDA-aware MPI feature in 
PETSc. It is now in PETSc-3.12, released last week. I have a little fix after 
that, so you better use petsc master.  Use petsc option -use_gpu_aware_mpi to 
enable it. On Summit, you also need jsrun --smpiargs="-gpu" to enable IBM 
Spectrum MPI's CUDA support. If you run with multiple MPI ranks per GPU, you 
also need #BSUB -alloc_flags gpumps in your job script.
  My experiments (using a simple test doing repeated MatMult) on Summit is 
mixed. With one MPI rank per GPU, I saw very good performance improvement (up 
to 25%). But with multiple ranks per GPU, I did not see improvement.  That 
sounds absurd since it should be easier for MPI ranks communicate data on the 
same GPU. I'm investigating this issue.
  If you can also evaluate this feature with your production code, that would 
be helpful.
  Thanks.
--Junchao Zhang


On Thu, Aug 22, 2019 at 11:34 AM David Gutzwiller 
<david.gutzwil...@gmail.com<mailto:david.gutzwil...@gmail.com>> wrote:
Hello Junchao,

Spectacular news!

I have our production code running on Summit (Power9 + Nvidia V100) and on 
local x86 workstations, and I can definitely provide comparative benchmark data 
with this feature once it is ready.  Just let me know when it is available for 
testing and I'll be happy to contribute.

Thanks,

-David

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
    Virus-free. 
www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>

On Thu, Aug 22, 2019 at 7:22 AM Zhang, Junchao 
<jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
This feature is under active development. I hope I can make it usable in a 
couple of weeks. Thanks.
--Junchao Zhang


On Wed, Aug 21, 2019 at 3:21 PM David Gutzwiller via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
Hello,

I'm currently using PETSc for the GPU acceleration of simple Krylov solver with 
GMRES, without preconditioning.   This is within the framework of our in-house 
multigrid solver.  I am getting a good GPU speedup on the finest grid level but 
progressively worse performance on each coarse level.   This is not surprising, 
but I still hope to squeeze out some more performance, hopefully making it 
worthwhile to run some or all of the coarse grids on the GPU.

I started investigating with nvprof / nsight and essentially came to the same 
conclusion that Xiangdong reported in a recent thread (July 16, "MemCpy (HtoD 
and DtoH) in Krylov solver").  My question is a follow-up to that thread:

The MPI communication is staged from the host, which results in some H<->D 
transfers for every mat-vec operation.   A CUDA-aware MPI implementation might 
avoid these transfers for communication between ranks that are assigned to the 
same accelerator.   Has this been implemented or tested?

In our solver we typically run with multiple MPI ranks all assigned to a single 
device, and running with a single rank is not really feasible as we still have 
a sizable amount of work for the CPU to chew through.  Thus, I think quite a 
lot of the H<->D transfers could be avoided if I can skip the MPI staging on 
the host. I am quite new to PETSc so I wanted to ask around before blindly 
digging into this.

Thanks for your help,

David

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
    Virus-free. 
www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>

Reply via email to