GPU related stuff

2009-07-10 Thread Farshid Mossaiby

Thanks all for comments.

--- On Thu, 7/9/09, Matthew Knepley  wrote:

> From: Matthew Knepley 
> Subject: Re: GPU related stuff
> To: "For users of the development version of PETSc" 
> Date: Thursday, July 9, 2009, 5:09 PM
> On Thu, Jul 9, 2009 at 7:31 AM, Jed
> Brown 
> wrote:
> 
> Matthew Knepley wrote:
> 
> 
> 
> > PCs which have high flop to memory access ratios look
> good. ?No
> 
> > surprise there.
> 
> 
> 
> My concern here is that almost all "good"
> preconditioners are
> 
> multiplicative in the fine-grained kernels or do
> significant work on
> 
> coarse levels. ?Both of these are very bad for putting on
> a GPU.
> 
> Switching from SOR or ILU to Jacobi or red-black GS will
> greatly improve
> 
> the throughput on a GPU, but is normally much less
> effective. ?Since the
> 
> GPU typically needs thousands of threads to attain high
> performance,
> 
> it's really hard to use on all but the finest
> level.
> I agree with all these comments. I have no idea how to make
> those PCs
> work. I am counting on Barry's genius here.
> ?
> 
> 
> One of the more interesting preconditioners would be
> 3-level balancing
> 
> or overlapping DD with very small subdomains (like
> thousands of
> 
> subdomains per process). ?There would then be 1 subregion
> per process
> 
> and a global coarse level. ?This would allow the PC to be
> additive with
> 
> chunks of the right block size, while keeping a minimal
> amount of work
> 
> on the coarser levels (which are handled by the CPU).
> ?(It's really hard
> 
> to get multigrid to coarsen this rapidly, as in 1M dofs to
> 10 dofs in 2
> 
> levels.) ?Unfortunately, this sort of scheme is rather
> problem- and
> 
> discretization-dependent, as well as rather complex to
> implement.
> With regard to targets, my strategy is to implement things
> that I can
> prove work well on a GPU. For starters, we have FMM. We
> have done
> 
> a complete computational model and can prove that this will
> scale almost
> indefinitely. The first paper is out, and the other 2 are
> almost done. We are
> also implementing wavelets, since the structure and proofs
> are very similar
> 
> to FMM.
> ?
> The strategy is to use FMM/Wavelets for problems they can
> solve to precondition
> more complex problems. The prototype is Stokes
> preconditioning variable
> viscosity Stokes, which I am working on with Dave May and
> Dave Yuen.
> 
> 
> 
> 
> I'll be interested to see what sort of performance you
> can get for real
> 
> preconditioners on a GPU.
> Felipe Cruz has preliminary numbers for FMM: 500 GF on a
> single 1060C!
> That is probably 10 times what you can hope to achieve with
> traditional
> relaxation (I think).
> 
> 
> ?? Matt
> ?
> 
> Jed
> -- 
> What most experimenters take for granted before they begin
> their experiments is infinitely more interesting than any
> results to which their experiments lead.
> -- Norbert Wiener
> 
> 
> 


  



GPU related stuff

2009-07-09 Thread Jed Brown
Matthew Knepley wrote:

> PCs which have high flop to memory access ratios look good.  No
> surprise there.

My concern here is that almost all "good" preconditioners are
multiplicative in the fine-grained kernels or do significant work on
coarse levels.  Both of these are very bad for putting on a GPU.
Switching from SOR or ILU to Jacobi or red-black GS will greatly improve
the throughput on a GPU, but is normally much less effective.  Since the
GPU typically needs thousands of threads to attain high performance,
it's really hard to use on all but the finest level.

One of the more interesting preconditioners would be 3-level balancing
or overlapping DD with very small subdomains (like thousands of
subdomains per process).  There would then be 1 subregion per process
and a global coarse level.  This would allow the PC to be additive with
chunks of the right block size, while keeping a minimal amount of work
on the coarser levels (which are handled by the CPU).  (It's really hard
to get multigrid to coarsen this rapidly, as in 1M dofs to 10 dofs in 2
levels.)  Unfortunately, this sort of scheme is rather problem- and
discretization-dependent, as well as rather complex to implement.

I'll be interested to see what sort of performance you can get for real
preconditioners on a GPU.

Jed

-- next part --
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 260 bytes
Desc: OpenPGP digital signature
URL: 



GPU related stuff

2009-07-09 Thread Matthew Knepley
On Thu, Jul 9, 2009 at 7:31 AM, Jed Brown  wrote:

> Matthew Knepley wrote:
>
> > PCs which have high flop to memory access ratios look good.  No
> > surprise there.
>
> My concern here is that almost all "good" preconditioners are
> multiplicative in the fine-grained kernels or do significant work on
> coarse levels.  Both of these are very bad for putting on a GPU.
> Switching from SOR or ILU to Jacobi or red-black GS will greatly improve
> the throughput on a GPU, but is normally much less effective.  Since the
> GPU typically needs thousands of threads to attain high performance,
> it's really hard to use on all but the finest level.


I agree with all these comments. I have no idea how to make those PCs
work. I am counting on Barry's genius here.


>
> One of the more interesting preconditioners would be 3-level balancing
> or overlapping DD with very small subdomains (like thousands of
> subdomains per process).  There would then be 1 subregion per process
> and a global coarse level.  This would allow the PC to be additive with
> chunks of the right block size, while keeping a minimal amount of work
> on the coarser levels (which are handled by the CPU).  (It's really hard
> to get multigrid to coarsen this rapidly, as in 1M dofs to 10 dofs in 2
> levels.)  Unfortunately, this sort of scheme is rather problem- and
> discretization-dependent, as well as rather complex to implement.


With regard to targets, my strategy is to implement things that I can
prove work well on a GPU. For starters, we have FMM. We have done
a complete computational model and can prove that this will scale almost
indefinitely. The first paper is out, and the other 2 are almost done. We
are
also implementing wavelets, since the structure and proofs are very similar
to FMM.

The strategy is to use FMM/Wavelets for problems they can solve to
precondition
more complex problems. The prototype is Stokes preconditioning variable
viscosity Stokes, which I am working on with Dave May and Dave Yuen.


> I'll be interested to see what sort of performance you can get for real
> preconditioners on a GPU.


Felipe Cruz has preliminary numbers for FMM: 500 GF on a single 1060C!
That is probably 10 times what you can hope to achieve with traditional
relaxation (I think).

   Matt


>
> Jed
>
-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 



GPU related stuff

2009-07-09 Thread Matthew Knepley
On Thu, Jul 9, 2009 at 6:15 AM, Farshid Mossaiby  wrote:

>
> Hi all,
>
> Some time ago on this list, there was some discussion about GPU and a GPU
> version of PETSc. I would like to know if there has been any progress. Also,
> I need some advice on preconditioners suitable for GPU platforms.


We have been progressing, but will not make a release until the fall. PCs
which have high flop to memory access ratios look good.
No surprise there.


>
> May I know what platform/language you are using, e.g. nVidia/CUDA, ATI/ATI
> Stream SDK or OpenCL?


CUDA.

  Matt


>
> Best regards,
> Farshid Mossaiby
>
-- 
What most experimenters take for granted before they begin their experiments
is infinitely more interesting than any results to which their experiments
lead.
-- Norbert Wiener
-- next part --
An HTML attachment was scrubbed...
URL: 



GPU related stuff

2009-07-09 Thread Farshid Mossaiby

Hi all,

Some time ago on this list, there was some discussion about GPU and a GPU 
version of PETSc. I would like to know if there has been any progress. Also, I 
need some advice on preconditioners suitable for GPU platforms.

May I know what platform/language you are using, e.g. nVidia/CUDA, ATI/ATI 
Stream SDK or OpenCL?

Best regards,
Farshid Mossaiby