GPU related stuff

2009-07-09 Thread Farshid Mossaiby

Hi all,

Some time ago on this list, there was some discussion about GPU and a GPU 
version of PETSc. I would like to know if there has been any progress. Also, I 
need some advice on preconditioners suitable for GPU platforms.

May I know what platform/language you are using, e.g. nVidia/CUDA, ATI/ATI 
Stream SDK or OpenCL?

Best regards,
Farshid Mossaiby


  



GPU related stuff

2009-07-09 Thread Matthew Knepley
On Thu, Jul 9, 2009 at 6:15 AM, Farshid Mossaiby mossaiby at yahoo.com wrote:


 Hi all,

 Some time ago on this list, there was some discussion about GPU and a GPU
 version of PETSc. I would like to know if there has been any progress. Also,
 I need some advice on preconditioners suitable for GPU platforms.


We have been progressing, but will not make a release until the fall. PCs
which have high flop to memory access ratios look good.
No surprise there.



 May I know what platform/language you are using, e.g. nVidia/CUDA, ATI/ATI
 Stream SDK or OpenCL?


CUDA.

  Matt



 Best regards,
 Farshid Mossaiby

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20090709/1609129e/attachment.html


GPU related stuff

2009-07-09 Thread Jed Brown
Matthew Knepley wrote:

 PCs which have high flop to memory access ratios look good.  No
 surprise there.

My concern here is that almost all good preconditioners are
multiplicative in the fine-grained kernels or do significant work on
coarse levels.  Both of these are very bad for putting on a GPU.
Switching from SOR or ILU to Jacobi or red-black GS will greatly improve
the throughput on a GPU, but is normally much less effective.  Since the
GPU typically needs thousands of threads to attain high performance,
it's really hard to use on all but the finest level.

One of the more interesting preconditioners would be 3-level balancing
or overlapping DD with very small subdomains (like thousands of
subdomains per process).  There would then be 1 subregion per process
and a global coarse level.  This would allow the PC to be additive with
chunks of the right block size, while keeping a minimal amount of work
on the coarser levels (which are handled by the CPU).  (It's really hard
to get multigrid to coarsen this rapidly, as in 1M dofs to 10 dofs in 2
levels.)  Unfortunately, this sort of scheme is rather problem- and
discretization-dependent, as well as rather complex to implement.

I'll be interested to see what sort of performance you can get for real
preconditioners on a GPU.

Jed

-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20090709/8af607e0/attachment.pgp


GPU related stuff

2009-07-09 Thread Matthew Knepley
On Thu, Jul 9, 2009 at 7:31 AM, Jed Brown jed at 59a2.org wrote:

 Matthew Knepley wrote:

  PCs which have high flop to memory access ratios look good.  No
  surprise there.

 My concern here is that almost all good preconditioners are
 multiplicative in the fine-grained kernels or do significant work on
 coarse levels.  Both of these are very bad for putting on a GPU.
 Switching from SOR or ILU to Jacobi or red-black GS will greatly improve
 the throughput on a GPU, but is normally much less effective.  Since the
 GPU typically needs thousands of threads to attain high performance,
 it's really hard to use on all but the finest level.


I agree with all these comments. I have no idea how to make those PCs
work. I am counting on Barry's genius here.



 One of the more interesting preconditioners would be 3-level balancing
 or overlapping DD with very small subdomains (like thousands of
 subdomains per process).  There would then be 1 subregion per process
 and a global coarse level.  This would allow the PC to be additive with
 chunks of the right block size, while keeping a minimal amount of work
 on the coarser levels (which are handled by the CPU).  (It's really hard
 to get multigrid to coarsen this rapidly, as in 1M dofs to 10 dofs in 2
 levels.)  Unfortunately, this sort of scheme is rather problem- and
 discretization-dependent, as well as rather complex to implement.


With regard to targets, my strategy is to implement things that I can
prove work well on a GPU. For starters, we have FMM. We have done
a complete computational model and can prove that this will scale almost
indefinitely. The first paper is out, and the other 2 are almost done. We
are
also implementing wavelets, since the structure and proofs are very similar
to FMM.

The strategy is to use FMM/Wavelets for problems they can solve to
precondition
more complex problems. The prototype is Stokes preconditioning variable
viscosity Stokes, which I am working on with Dave May and Dave Yuen.


 I'll be interested to see what sort of performance you can get for real
 preconditioners on a GPU.


Felipe Cruz has preliminary numbers for FMM: 500 GF on a single 1060C!
That is probably 10 times what you can hope to achieve with traditional
relaxation (I think).

   Matt



 Jed

-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 
http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20090709/d32d160d/attachment.html