PETSc on GPUs
In implementing Vec in the GPU a VecScatter would take the entries directly from GPU copy of the vector and put them appropriately into the GPU copy of the other vector, thus it would be completely transparent from the MatMult_MPIAIJ() code. In other words, if you have to modify the MatMult_MPIAIJ IN ANYWAY for the GPU then there is something wrong. Barry Note the VecScatter would need to marshal the values up from the GPU into the regular memory, do the appropriate message passing to the resulting processes and then marshal the values back down to the destination GPU. Since the current VecScatter handles all the message passing I think the GPU VecScatter() could be written to 1) move the needed GPU values up to regular memory (inside work MPI Vecs) then 2) use a regular VecScatter on those work vectors then 3) move the results down from the regular memory to the GPU. In other words PETSc has been designed since 1991 to be able to reuse almost all the infrastructure and only need new low level kernels written to run on GPUs. :-) Fun stuff, sure beats working :-) On Sep 18, 2008, at 3:18 PM, Richard Tran Mills wrote: > Ahmed, > > If you take a look in > > src/mat/impls/aij/mpi/ > > at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is > stored locally as two "sequential" matrics, A and B. A contains the > "diagonal" portion of the matrix, and B contains the "off-diagonal" > portion. See page 59 of the PETSc User manual for a quick > illustration of how this partitioning works. Because vectors are > partitioned row-wise among processors in the same manner as the > matrix, each process already has the vector entries required to form > the portion of the matrix-vector product ("mat-vec") involving the > diagonal portion 'A' can be done with data that reside locally. The > off-diagonal portion 'B', however, requires gathering vector entries > that reside off-processor. If you look in mpiaij.c at the > MatMult_MPIAIJ, you will see the following lines: > > ierr = VecScatterBegin(a->Mvctx,xx,a- > >lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); > ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr); > ierr = VecScatterEnd(a->Mvctx,xx,a- > >lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); > ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr); > > The mat-vec involving a->A can be completed without any VecScatter, > but the one involving a->B needs the VecScatter to complete before > it can be done. (Note that the VecScatterBegin() occurs before the > mat-vec for a->A so that there might be some overlap of > communication and computation). > > Hopefully this helps elucidate what Matt meant, and didn't just > confuse you. > > --Richard > > Ahmed El Zein wrote: >> On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote: A question that I had regarding the PETSc code when I was thinking >>> about this was: You have the SeqAIJ matrix type and the the MPIAIJ type built around >>> it (or that is what I understand from the code). So basically you >>> implement the SeqAIJ type for the GPU and you get the MPI type for free? >>> Yes, that is true. However, note that in the MPI step, you will >>> need a >>> gather >>> operation to get the second matrix multiply to work. >> Matt, >> Could you explain a bit more? >> Ahmed > > -- > Richard Tran Mills, Ph.D.| E-mail: rmills at climate.ornl.gov > Computational Scientist | Phone: (865) 241-3198 > Computational Earth Sciences Group | Fax:(865) 574-0405 > Oak Ridge National Laboratory| http://climate.ornl.gov/~rmills >
PETSc on GPUs
Ahmed, If you take a look in src/mat/impls/aij/mpi/ at mpiaij.h and mpiaij.c, you can see that an MPIAIJ matrix is stored locally as two "sequential" matrics, A and B. A contains the "diagonal" portion of the matrix, and B contains the "off-diagonal" portion. See page 59 of the PETSc User manual for a quick illustration of how this partitioning works. Because vectors are partitioned row-wise among processors in the same manner as the matrix, each process already has the vector entries required to form the portion of the matrix-vector product ("mat-vec") involving the diagonal portion 'A' can be done with data that reside locally. The off-diagonal portion 'B', however, requires gathering vector entries that reside off-processor. If you look in mpiaij.c at the MatMult_MPIAIJ, you will see the following lines: ierr = VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr); ierr = VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr); The mat-vec involving a->A can be completed without any VecScatter, but the one involving a->B needs the VecScatter to complete before it can be done. (Note that the VecScatterBegin() occurs before the mat-vec for a->A so that there might be some overlap of communication and computation). Hopefully this helps elucidate what Matt meant, and didn't just confuse you. --Richard Ahmed El Zein wrote: > On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote: >>> A question that I had regarding the PETSc code when I was thinking >> about >>> this was: >>> You have the SeqAIJ matrix type and the the MPIAIJ type built around >> it >>> (or that is what I understand from the code). So basically you >> implement >>> the SeqAIJ type for the GPU and you get the MPI type for free? >> Yes, that is true. However, note that in the MPI step, you will need a >> gather >> operation to get the second matrix multiply to work. > Matt, > Could you explain a bit more? > > Ahmed -- Richard Tran Mills, Ph.D.| E-mail: rmills at climate.ornl.gov Computational Scientist | Phone: (865) 241-3198 Computational Earth Sciences Group | Fax:(865) 574-0405 Oak Ridge National Laboratory| http://climate.ornl.gov/~rmills
PETSc on GPUs
On Thu, Sep 18, 2008 at 1:34 PM, Ahmed El Zein wrote: > On Thu, 2008-09-18 at 10:20 -0500, Matthew Knepley wrote: >> > A question that I had regarding the PETSc code when I was thinking >> about >> > this was: >> > You have the SeqAIJ matrix type and the the MPIAIJ type built around >> it >> > (or that is what I understand from the code). So basically you >> implement >> > the SeqAIJ type for the GPU and you get the MPI type for free? >> >> Yes, that is true. However, note that in the MPI step, you will need a >> gather >> operation to get the second matrix multiply to work. > Matt, > Could you explain a bit more? The diagonal block performs a simple, serial AIJ MatMult(). However, for the offdiagonal block, you must gather the Vec entries for nonzero columns in that block, fit them into a small Vec, do another serial AIJ MatMult(), and then scatter them back to the correct locations. Matt > Ahmed -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
PETSc on GPUs
On Wed, 2008-09-17 at 21:33 -0500, Barry Smith wrote: > Ahmed, > > This is very cool. > On Sep 17, 2008, at 8:05 PM, Ahmed El Zein wrote: > > > Harald, > > I am working on implementing SpMV on a NVIDIA GPU with CUDA for SML > > applications (just about finished actually). > > > > As the SML application I am planing to base my work on uses PETSc, I > > have written some functions that will convert AIJ matrices and Vectors > > to SP and copy them to the GPU, multiply them and copy them back to > > the > > host. I would be happy to share them with you if you want. > > I think to really make the GPU truly a large step forward in > performance, the Vec's need to > be kept on the GPU and only transported back to the main CPU when > absolutely needed. For example, consider a KSP solver like CG, it would > run on the main CPU but each Vec is actually just a handle for the > true vector > entries that are in the GPU memory, a call to VecAXPY(), for example, > would > pass the scalar and the two Vec handles down to the GPU where the actual > axpy is performed. With this paradigm the only values passed back to the > main CPU are scalars. This is why I think this work has to be done only > on the latests GPU systems with lots of memory. > You are right! That is what I do for the SML application. I convert and copy the matrix to the GPU and then iteratively send a new vector to the GPU for multiplication. In fact unless the matrix will be reused at least 4 times, there would be no performance gain! The 8800 GTX has 768 MB of memory but you can have multiple GPUs running on your machine and split your data amongst them to effectively get more memory. What I was thinking of was to add GPU pointers to a PETSC Mat or Vec object with an optional shadow parameter. If shadow is enabled the host will keep a copy of what is on the GPU in main memory. That way if there are changes to the original matrix: 1. It might be less expensive to make the modification on the host. 2. It might be possible to update the matrix on the GPU by sending diffs, as the copying of data back and forth is the most expensive operation. It also will allow the use of that matrix on both the host and the GPU to offer maximum flexibility. Matrix assembly would be done on the host anyway. (I am not sure if the above points are loads of rubbish or not. But I think that there are many options to be considered in a PETSc GPU implementation.) A question that I had regarding the PETSc code when I was thinking about this was: You have the SeqAIJ matrix type and the the MPIAIJ type built around it (or that is what I understand from the code). So basically you implement the SeqAIJ type for the GPU and you get the MPI type for free? Ahmed > Barry > > > > > > > While this would be outside the scope of my MPhil I would be very > > interested in helping to add GPU support for PETSC. I have not yet had > > any experience with CTM for programming ATI GPUs but I believe there > > would not be a huge difference. > > > > I have access to a GeForce 8800GTX GPU (single precession only) at the > > ANU. I have been talking with my supervisor about getting a GTX280 or > > GTX 260 (supports double precision) but I don't know if we will be > > getting one. > > > > Anyway I would like to help. So if anyone would like to start thinking > > how this would be best implemented, I am available. :) > > > > Ahmed > > > > On Wed, 2008-09-17 at 08:24 -0700, Harald Pfeiffer wrote: > >> Hi, > >> > >> do you know whether there are any efforts to run PETSc on GPUs > >> (graphical processing units)? > >> > >> Thanks, > >> Harald > >> > >> > >> > >
PETSc on GPUs
Harald, I am working on implementing SpMV on a NVIDIA GPU with CUDA for SML applications (just about finished actually). As the SML application I am planing to base my work on uses PETSc, I have written some functions that will convert AIJ matrices and Vectors to SP and copy them to the GPU, multiply them and copy them back to the host. I would be happy to share them with you if you want. While this would be outside the scope of my MPhil I would be very interested in helping to add GPU support for PETSC. I have not yet had any experience with CTM for programming ATI GPUs but I believe there would not be a huge difference. I have access to a GeForce 8800GTX GPU (single precession only) at the ANU. I have been talking with my supervisor about getting a GTX280 or GTX 260 (supports double precision) but I don't know if we will be getting one. Anyway I would like to help. So if anyone would like to start thinking how this would be best implemented, I am available. :) Ahmed On Wed, 2008-09-17 at 08:24 -0700, Harald Pfeiffer wrote: > Hi, > > do you know whether there are any efforts to run PETSc on GPUs > (graphical processing units)? > > Thanks, > Harald > > >
PETSc on GPUs
On Wed, Sep 17, 2008 at 10:51 PM, Ahmed El Zein wrote: > On Wed, 2008-09-17 at 21:33 -0500, Barry Smith wrote: >> Ahmed, >> >> This is very cool. >> On Sep 17, 2008, at 8:05 PM, Ahmed El Zein wrote: >> >> > Harald, >> > I am working on implementing SpMV on a NVIDIA GPU with CUDA for SML >> > applications (just about finished actually). >> > >> > As the SML application I am planing to base my work on uses PETSc, I >> > have written some functions that will convert AIJ matrices and Vectors >> > to SP and copy them to the GPU, multiply them and copy them back to >> > the >> > host. I would be happy to share them with you if you want. >> >> I think to really make the GPU truly a large step forward in >> performance, the Vec's need to >> be kept on the GPU and only transported back to the main CPU when >> absolutely needed. For example, consider a KSP solver like CG, it would >> run on the main CPU but each Vec is actually just a handle for the >> true vector >> entries that are in the GPU memory, a call to VecAXPY(), for example, >> would >> pass the scalar and the two Vec handles down to the GPU where the actual >> axpy is performed. With this paradigm the only values passed back to the >> main CPU are scalars. This is why I think this work has to be done only >> on the latests GPU systems with lots of memory. >> > You are right! That is what I do for the SML application. I convert and > copy the matrix to the GPU and then iteratively send a new vector to the > GPU for multiplication. In fact unless the matrix will be reused at > least 4 times, there would be no performance gain! > > The 8800 GTX has 768 MB of memory but you can have multiple GPUs running > on your machine and split your data amongst them to effectively get more > memory. > > What I was thinking of was to add GPU pointers to a PETSC Mat or Vec > object with an optional shadow parameter. If shadow is enabled the host > will keep a copy of what is on the GPU in main memory. That way if there > are changes to the original matrix: > 1. It might be less expensive to make the modification on the host. > 2. It might be possible to update the matrix on the GPU by sending > diffs, as the copying of data back and forth is the most expensive > operation. > > It also will allow the use of that matrix on both the host and the GPU > to offer maximum flexibility. Matrix assembly would be done on the host > anyway. > > (I am not sure if the above points are loads of rubbish or not. But I > think that there are many options to be considered in a PETSc GPU > implementation.) > > > A question that I had regarding the PETSc code when I was thinking about > this was: > You have the SeqAIJ matrix type and the the MPIAIJ type built around it > (or that is what I understand from the code). So basically you implement > the SeqAIJ type for the GPU and you get the MPI type for free? Yes, that is true. However, note that in the MPI step, you will need a gather operation to get the second matrix multiply to work. Matt > Ahmed > >> Barry >> >> > >> > >> > While this would be outside the scope of my MPhil I would be very >> > interested in helping to add GPU support for PETSC. I have not yet had >> > any experience with CTM for programming ATI GPUs but I believe there >> > would not be a huge difference. >> > >> > I have access to a GeForce 8800GTX GPU (single precession only) at the >> > ANU. I have been talking with my supervisor about getting a GTX280 or >> > GTX 260 (supports double precision) but I don't know if we will be >> > getting one. >> > >> > Anyway I would like to help. So if anyone would like to start thinking >> > how this would be best implemented, I am available. :) >> > >> > Ahmed >> > >> > On Wed, 2008-09-17 at 08:24 -0700, Harald Pfeiffer wrote: >> >> Hi, >> >> >> >> do you know whether there are any efforts to run PETSc on GPUs >> >> (graphical processing units)? >> >> >> >> Thanks, >> >> Harald >> >> >> >> >> >> >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener