PETSc on GPUs

2008-09-18 Thread Barry Smith

In implementing Vec in the GPU a VecScatter would take the entries  
directly from GPU copy of the
vector and put them appropriately into the GPU copy of the other  
vector, thus it would be completely
transparent from the MatMult_MPIAIJ() code. In other words, if you  
have to modify the MatMult_MPIAIJ
IN ANYWAY for the GPU then there is something wrong.

Barry

Note the VecScatter would need to marshal the values up from the GPU  
into the regular memory,
do the appropriate message passing to the resulting processes and then  
marshal the values back
down to the destination GPU. Since the current VecScatter handles all  
the message passing I think
the GPU VecScatter() could be written to 1) move the needed GPU values  
up to regular memory
(inside work MPI Vecs) then 2) use a regular VecScatter on those work  
vectors then 3) move the results
down from the regular memory to the GPU.

In other words PETSc has been designed since 1991 to be able to reuse  
almost all the infrastructure
and only need new low level kernels written to run on GPUs. :-)

Fun stuff, sure beats working :-)



On Sep 18, 2008, at 3:18 PM, Richard Tran Mills wrote:

> Ahmed,
>
> If you take a look in
>
>  src/mat/impls/aij/mpi/
>
> at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is  
> stored locally as two "sequential" matrics, A and B.  A contains the  
> "diagonal" portion of the matrix, and B contains the "off-diagonal"  
> portion.  See page 59 of the PETSc User manual for a quick  
> illustration of how this partitioning works. Because vectors are  
> partitioned row-wise among processors in the same manner as the  
> matrix, each process already has the vector entries required to form  
> the portion of the matrix-vector product ("mat-vec") involving the  
> diagonal portion 'A' can be done with data that reside locally.  The  
> off-diagonal portion 'B', however, requires gathering vector entries  
> that reside off-processor.  If you look in mpiaij.c at the  
> MatMult_MPIAIJ, you will see the following lines:
>
>  ierr = VecScatterBegin(a->Mvctx,xx,a- 
> >lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>  ierr = VecScatterEnd(a->Mvctx,xx,a- 
> >lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>
> The mat-vec involving a->A can be completed without any VecScatter,  
> but the one involving a->B needs the VecScatter to complete before  
> it can be done. (Note that the VecScatterBegin() occurs before the  
> mat-vec for a->A so that there might be some overlap of  
> communication and computation).
>
> Hopefully this helps elucidate what Matt meant, and didn't just  
> confuse you.
>
> --Richard
>
> Ahmed El Zein wrote:
>> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote:
 A question that I had regarding the PETSc code when I was thinking
>>> about
 this was:
 You have the SeqAIJ matrix type and the the MPIAIJ type built  
 around
>>> it
 (or that is what I understand from the code). So basically you
>>> implement
 the SeqAIJ type for the GPU and you get the MPI type for free?
>>> Yes, that is true. However, note that in the MPI step, you will  
>>> need a
>>> gather
>>> operation to get the second matrix multiply to work.
>> Matt,
>> Could you explain a bit more?
>> Ahmed
>
> -- 
> Richard Tran Mills, Ph.D.|   E-mail: rmills at climate.ornl.gov
> Computational Scientist  |   Phone:  (865) 241-3198
> Computational Earth Sciences Group   |   Fax:(865) 574-0405
> Oak Ridge National Laboratory|   http://climate.ornl.gov/~rmills
>




PETSc on GPUs

2008-09-18 Thread Richard Tran Mills
Ahmed,

If you take a look in

   src/mat/impls/aij/mpi/

at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is stored locally 
as two "sequential" matrics, A and B.  A contains the "diagonal" portion of 
the matrix, and B contains the "off-diagonal" portion.  See page 59 of the 
PETSc User manual for a quick illustration of how this partitioning works. 
Because vectors are partitioned row-wise among processors in the same manner 
as the matrix, each process already has the vector entries required to form 
the portion of the matrix-vector product ("mat-vec") involving the diagonal 
portion 'A' can be done with data that reside locally.  The off-diagonal 
portion 'B', however, requires gathering vector entries that reside 
off-processor.  If you look in mpiaij.c at the MatMult_MPIAIJ, you will see 
the following lines:

   ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
   ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);

The mat-vec involving a->A can be completed without any VecScatter, but the 
one involving a->B needs the VecScatter to complete before it can be done. 
(Note that the VecScatterBegin() occurs before the mat-vec for a->A so that 
there might be some overlap of communication and computation).

Hopefully this helps elucidate what Matt meant, and didn't just confuse you.

--Richard

Ahmed El Zein wrote:
> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote:
>>> A question that I had regarding the PETSc code when I was thinking
>> about
>>> this was:
>>> You have the SeqAIJ matrix type and the the MPIAIJ type built around
>> it
>>> (or that is what I understand from the code). So basically you
>> implement
>>> the SeqAIJ type for the GPU and you get the MPI type for free?
>> Yes, that is true. However, note that in the MPI step, you will need a
>> gather
>> operation to get the second matrix multiply to work.
> Matt,
> Could you explain a bit more?
> 
> Ahmed

-- 
Richard Tran Mills, Ph.D.|   E-mail: rmills at climate.ornl.gov
Computational Scientist  |   Phone:  (865) 241-3198
Computational Earth Sciences Group   |   Fax:(865) 574-0405
Oak Ridge National Laboratory|   http://climate.ornl.gov/~rmills




PETSc on GPUs

2008-09-18 Thread Matthew Knepley
On Thu, Sep 18, 2008 at 1:34 PM, Ahmed El Zein  wrote:
> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote:
>> > A question that I had regarding the PETSc code when I was thinking
>> about
>> > this was:
>> > You have the SeqAIJ matrix type and the the MPIAIJ type built around
>> it
>> > (or that is what I understand from the code). So basically you
>> implement
>> > the SeqAIJ type for the GPU and you get the MPI type for free?
>>
>> Yes, that is true. However, note that in the MPI step, you will need a
>> gather
>> operation to get the second matrix multiply to work.
> Matt,
> Could you explain a bit more?

The diagonal block performs a simple, serial AIJ MatMult(). However, for the
offdiagonal block, you must gather the Vec entries for nonzero columns in
that block, fit them into a small Vec, do another serial AIJ MatMult(), and then
scatter them back to the correct locations.

  Matt

> Ahmed
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener




PETSc on GPUs

2008-09-18 Thread Ahmed El Zein
On Wed, 2008-09-17 at 21:33 -0500, Barry Smith wrote:
> Ahmed,
> 
> This is very cool.
> On Sep 17, 2008, at 8:05 PM, Ahmed El Zein wrote:
> 
> > Harald,
> > I am working on implementing SpMV on a NVIDIA GPU with CUDA for SML
> > applications (just about finished actually).
> >
> > As the SML application I am planing to base my work on uses PETSc, I
> > have written some functions that will convert AIJ matrices and Vectors
> > to SP and copy them to the GPU, multiply them and copy them back to  
> > the
> > host. I would be happy to share them with you if you want.
> 
> I think to really make the GPU truly a large step forward in  
> performance, the Vec's need to
> be kept on the GPU and only transported back to the main CPU when
> absolutely needed. For example, consider a KSP solver like CG, it would
> run on the main CPU but each Vec is actually just a handle for the  
> true vector
> entries that are in the GPU memory, a call to VecAXPY(), for example,  
> would
> pass the scalar and the two Vec handles down to the GPU where the actual
> axpy is performed. With this paradigm the only values passed back to the
> main CPU are scalars. This is why I think this work has to be done only
> on the latests GPU systems with lots of memory.
> 
You are right! That is what I do for the SML application. I convert and
copy the matrix to the GPU and then iteratively send a new vector to the
GPU for multiplication. In fact unless the matrix will be reused at
least 4 times, there would be no performance gain!

The 8800 GTX has 768 MB of memory but you can have multiple GPUs running
on your machine and split your data amongst them to effectively get more
memory.

What I was thinking of was to add GPU pointers to a PETSC Mat or Vec
object with an optional shadow parameter. If shadow is enabled the host
will keep a copy of what is on the GPU in main memory. That way if there
are changes to the original matrix:
1. It might be less expensive to make the modification on the host.
2. It might be possible to update the matrix on the GPU by sending
diffs, as the copying of data back and forth is the most expensive
operation.

It also will allow the use of that matrix on both the host and the GPU
to offer maximum flexibility. Matrix assembly would be done on the host
anyway.

(I am not sure if the above points are loads of rubbish or not. But I
think that there are many options to be considered in a PETSc GPU
implementation.)


A question that I had regarding the PETSc code when I was thinking about
this was:
You have the SeqAIJ matrix type and the the MPIAIJ type built around it
(or that is what I understand from the code). So basically you implement
the SeqAIJ type for the GPU and you get the MPI type for free?

Ahmed

> Barry 
> 
> >
> >
> > While this would be outside the scope of my MPhil I would be very
> > interested in helping to add GPU support for PETSC. I have not yet had
> > any experience with CTM for programming ATI GPUs but I believe there
> > would not be a huge difference.
> >
> > I have access to a GeForce 8800GTX GPU (single precession only) at the
> > ANU. I have been talking with my supervisor about getting a GTX280 or
> > GTX 260 (supports double precision) but I don't know if we will be
> > getting one.
> >
> > Anyway I would like to help. So if anyone would like to start thinking
> > how this would be best implemented, I am available. :)
> >
> > Ahmed
> >
> > On Wed, 2008-09-17 at 08:24 -0700, Harald Pfeiffer wrote:
> >> Hi,
> >>
> >> do you know whether there are any efforts to run PETSc on GPUs
> >> (graphical processing units)?
> >>
> >> Thanks,
> >> Harald
> >>
> >>
> >>
> >




PETSc on GPUs

2008-09-18 Thread Ahmed El Zein
Harald,
I am working on implementing SpMV on a NVIDIA GPU with CUDA for SML
applications (just about finished actually).

As the SML application I am planing to base my work on uses PETSc, I
have written some functions that will convert AIJ matrices and Vectors
to SP and copy them to the GPU, multiply them and copy them back to the
host. I would be happy to share them with you if you want.

While this would be outside the scope of my MPhil I would be very
interested in helping to add GPU support for PETSC. I have not yet had
any experience with CTM for programming ATI GPUs but I believe there
would not be a huge difference.

I have access to a GeForce 8800GTX GPU (single precession only) at the
ANU. I have been talking with my supervisor about getting a GTX280 or
GTX 260 (supports double precision) but I don't know if we will be
getting one.
 
Anyway I would like to help. So if anyone would like to start thinking
how this would be best implemented, I am available. :)

Ahmed

On Wed, 2008-09-17 at 08:24 -0700, Harald Pfeiffer wrote:
> Hi,
> 
> do you know whether there are any efforts to run PETSc on GPUs 
> (graphical processing units)?
> 
> Thanks,
> Harald
> 
> 
> 




PETSc on GPUs

2008-09-18 Thread Matthew Knepley
On Wed, Sep 17, 2008 at 10:51 PM, Ahmed El Zein  wrote:
> On Wed, 2008-09-17 at 21:33 -0500, Barry Smith wrote:
>> Ahmed,
>>
>> This is very cool.
>> On Sep 17, 2008, at 8:05 PM, Ahmed El Zein wrote:
>>
>> > Harald,
>> > I am working on implementing SpMV on a NVIDIA GPU with CUDA for SML
>> > applications (just about finished actually).
>> >
>> > As the SML application I am planing to base my work on uses PETSc, I
>> > have written some functions that will convert AIJ matrices and Vectors
>> > to SP and copy them to the GPU, multiply them and copy them back to
>> > the
>> > host. I would be happy to share them with you if you want.
>>
>> I think to really make the GPU truly a large step forward in
>> performance, the Vec's need to
>> be kept on the GPU and only transported back to the main CPU when
>> absolutely needed. For example, consider a KSP solver like CG, it would
>> run on the main CPU but each Vec is actually just a handle for the
>> true vector
>> entries that are in the GPU memory, a call to VecAXPY(), for example,
>> would
>> pass the scalar and the two Vec handles down to the GPU where the actual
>> axpy is performed. With this paradigm the only values passed back to the
>> main CPU are scalars. This is why I think this work has to be done only
>> on the latests GPU systems with lots of memory.
>>
> You are right! That is what I do for the SML application. I convert and
> copy the matrix to the GPU and then iteratively send a new vector to the
> GPU for multiplication. In fact unless the matrix will be reused at
> least 4 times, there would be no performance gain!
>
> The 8800 GTX has 768 MB of memory but you can have multiple GPUs running
> on your machine and split your data amongst them to effectively get more
> memory.
>
> What I was thinking of was to add GPU pointers to a PETSC Mat or Vec
> object with an optional shadow parameter. If shadow is enabled the host
> will keep a copy of what is on the GPU in main memory. That way if there
> are changes to the original matrix:
> 1. It might be less expensive to make the modification on the host.
> 2. It might be possible to update the matrix on the GPU by sending
> diffs, as the copying of data back and forth is the most expensive
> operation.
>
> It also will allow the use of that matrix on both the host and the GPU
> to offer maximum flexibility. Matrix assembly would be done on the host
> anyway.
>
> (I am not sure if the above points are loads of rubbish or not. But I
> think that there are many options to be considered in a PETSc GPU
> implementation.)
>
>
> A question that I had regarding the PETSc code when I was thinking about
> this was:
> You have the SeqAIJ matrix type and the the MPIAIJ type built around it
> (or that is what I understand from the code). So basically you implement
> the SeqAIJ type for the GPU and you get the MPI type for free?

Yes, that is true. However, note that in the MPI step, you will need a gather
operation to get the second matrix multiply to work.

  Matt

> Ahmed
>
>> Barry
>>
>> >
>> >
>> > While this would be outside the scope of my MPhil I would be very
>> > interested in helping to add GPU support for PETSC. I have not yet had
>> > any experience with CTM for programming ATI GPUs but I believe there
>> > would not be a huge difference.
>> >
>> > I have access to a GeForce 8800GTX GPU (single precession only) at the
>> > ANU. I have been talking with my supervisor about getting a GTX280 or
>> > GTX 260 (supports double precision) but I don't know if we will be
>> > getting one.
>> >
>> > Anyway I would like to help. So if anyone would like to start thinking
>> > how this would be best implemented, I am available. :)
>> >
>> > Ahmed
>> >
>> > On Wed, 2008-09-17 at 08:24 -0700, Harald Pfeiffer wrote:
>> >> Hi,
>> >>
>> >> do you know whether there are any efforts to run PETSc on GPUs
>> >> (graphical processing units)?
>> >>
>> >> Thanks,
>> >> Harald
>> >>
>> >>
>> >>
>> >
>
-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener