Definitely I will do. Thanks.
--Junchao Zhang

On Thu, Aug 22, 2019 at 11:34 AM David Gutzwiller 
<david.gutzwil...@gmail.com<mailto:david.gutzwil...@gmail.com>> wrote:
Hello Junchao,

Spectacular news!

I have our production code running on Summit (Power9 + Nvidia V100) and on 
local x86 workstations, and I can definitely provide comparative benchmark data 
with this feature once it is ready.  Just let me know when it is available for 
testing and I'll be happy to contribute.

Thanks,

-David

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
    Virus-free. 
www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>

On Thu, Aug 22, 2019 at 7:22 AM Zhang, Junchao 
<jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
This feature is under active development. I hope I can make it usable in a 
couple of weeks. Thanks.
--Junchao Zhang


On Wed, Aug 21, 2019 at 3:21 PM David Gutzwiller via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
Hello,

I'm currently using PETSc for the GPU acceleration of simple Krylov solver with 
GMRES, without preconditioning.   This is within the framework of our in-house 
multigrid solver.  I am getting a good GPU speedup on the finest grid level but 
progressively worse performance on each coarse level.   This is not surprising, 
but I still hope to squeeze out some more performance, hopefully making it 
worthwhile to run some or all of the coarse grids on the GPU.

I started investigating with nvprof / nsight and essentially came to the same 
conclusion that Xiangdong reported in a recent thread (July 16, "MemCpy (HtoD 
and DtoH) in Krylov solver").  My question is a follow-up to that thread:

The MPI communication is staged from the host, which results in some H<->D 
transfers for every mat-vec operation.   A CUDA-aware MPI implementation might 
avoid these transfers for communication between ranks that are assigned to the 
same accelerator.   Has this been implemented or tested?

In our solver we typically run with multiple MPI ranks all assigned to a single 
device, and running with a single rank is not really feasible as we still have 
a sizable amount of work for the CPU to chew through.  Thus, I think quite a 
lot of the H<->D transfers could be avoided if I can skip the MPI staging on 
the host. I am quite new to PETSc so I wanted to ask around before blindly 
digging into this.

Thanks for your help,

David

[https://ipmcdn.avast.com/images/icons/icon-envelope-tick-round-orange-animated-no-repeat-v1.gif]<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon>
    Virus-free. 
www.avast.com<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link>

Reply via email to