Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Jed Brown
Karl Rupp writes: > Hi, > >>> If the context and queue are not attached to objects, then they would >>> essentially represent global state, which is something I want to avoid. >> >> I was thinking that the context returned would be specific to the Mat >> and the device it was about to run on. > >

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hi, Fair enough. Is a brute-force implementation for P1 elements sufficient as a baseline for discussion? src//ksp/ksp/examples/tutorials/ex4.c Ok, thanks, that's the COO part of the comparison then. I'll need to provide the CSR-like case then. :-) Best regards, Karli

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hi Matt, Here I believe strongly that we need tests. Nathan assured me that nothing is faster on the GPU than sort+reduce-by-key since they are highly optimized. I think they will be hard to beat, and the initial timings I had say that this is the case. I am willing to be wrong, but I am not wil

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hi Dominic, I think you were referring to the 'Mat' on the device, while I was referring to the plain PETSc Mat. The difficulty for a 'Mat' on the device is a limitation of OpenCL in defining opaque types: It is not possible to have something like typedef struct OpenCLMat { __global int row_

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Matthew Knepley
On Tue, Sep 24, 2013 at 8:11 AM, Karl Rupp wrote: > Hi Matt, > > > Here I believe strongly that we need tests. Nathan assured me that >> nothing is faster on the GPU than sort+reduce-by-key since >> they are highly optimized. I think they will be hard to beat, and the >> initial timings I had sa

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hi, If the context and queue are not attached to objects, then they would essentially represent global state, which is something I want to avoid. I was thinking that the context returned would be specific to the Mat and the device it was about to run on. Users who want to do the assembly rig

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Dominic Meiser
Hi, I think you were referring to the 'Mat' on the device, while I was referring to the plain PETSc Mat. The difficulty for a 'Mat' on the device is a limitation of OpenCL in defining opaque types: It is not possible to have something like typedef struct OpenCLMat { __global int row_indic

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Jed Brown
Karl Rupp writes: >> Okay, but why do they need to provide their own "Mat" data? > > If the context and queue are not attached to objects, then they would > essentially represent global state, which is something I want to avoid. I was thinking that the context returned would be specific to the

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Matthew Knepley
On Tue, Sep 24, 2013 at 7:07 AM, Karl Rupp wrote: > Hey, > > > On 09/24/2013 03:53 PM, Jed Brown wrote: > >> Karl Rupp writes: >> >>> I'm not talking about CSR vs. COO from the SpMV point of view, but >>> rather on how to store the actual data in global memory without >>> expensive subsequent so

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hey, On 09/24/2013 03:53 PM, Jed Brown wrote: Karl Rupp writes: I'm not talking about CSR vs. COO from the SpMV point of view, but rather on how to store the actual data in global memory without expensive subsequent sorts. Sure, but this seems like such a minor detail. With PetscScalar=doub

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hi, Perhaps that *GetSource method should also return an opaque device "Mat" pointer that the user is responsible for shepherding into the kernel From which they call the device MatSetValues? This is easy of the OpenCL management is within PETSc (i.e. context, buffers and command queues mana

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Jed Brown
Karl Rupp writes: > I'm not talking about CSR vs. COO from the SpMV point of view, but > rather on how to store the actual data in global memory without > expensive subsequent sorts. Sure, but this seems like such a minor detail. With PetscScalar=double and PetscInt=int, we have 16 bytes/entry

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hey, >> My primary metric for GPU kernels is memory transfers from global memory ('flops are free'), hence what I suggest for the assembly stage is to go with something CSR-like rather than COO. Pure CSR may be too expensive in terms of element lookup if there are several fields involved (partic

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Jed Brown
Karl Rupp writes: > Hey, > >> Perhaps that *GetSource method should also return an opaque device "Mat" >> pointer that the user is responsible for shepherding into the kernel >> From which they call the device MatSetValues? > > This is easy of the OpenCL management is within PETSc (i.e. context,

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Matthew Knepley
On Tue, Sep 24, 2013 at 2:45 AM, Jed Brown wrote: > Karl Rupp writes: > > >>> This can obviously be done incrementally, so storing a batch of > >>> element matrices to global memory is not a problem. > >> > >> If you store element matrices to global memory, you're using a ton of > >> bandwidth (

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Jed Brown
Karl Rupp writes: >>> This can obviously be done incrementally, so storing a batch of >>> element matrices to global memory is not a problem. >> >> If you store element matrices to global memory, you're using a ton of >> bandwidth (about 20x the size of the matrix if using P1 tets). >> >> What if

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
Hey, Perhaps that *GetSource method should also return an opaque device "Mat" pointer that the user is responsible for shepherding into the kernel From which they call the device MatSetValues? This is easy of the OpenCL management is within PETSc (i.e. context, buffers and command queues man

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-24 Thread Karl Rupp
This can obviously be done incrementally, so storing a batch of element matrices to global memory is not a problem. If you store element matrices to global memory, you're using a ton of bandwidth (about 20x the size of the matrix if using P1 tets). What if you do the sort/reduce thing within

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Jed Brown
Matthew Knepley writes: >> These users compute redundantly and set MAT_NO_OFF_PROC_ENTRIES. > > > Fine, we should have a flag like that. We do, it's called MAT_NO_OFF_PROC_ENTRIES. >> What if you do the sort/reduce thing within thread blocks, and only >> write the reduced version to global stor

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Matthew Knepley
On Mon, Sep 23, 2013 at 2:46 PM, Jed Brown wrote: > Matthew Knepley writes: > > > Okay, here is how I understand GPU matrix assembly. The only way it > > makes sense to me is in COO format which you may later convert. In > > mpiaijAssemble.cu I have code that > > > > - Produces COO rows > >

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Jed Brown
Karl Rupp writes: > a) > I think this needs a second thought on how we manage the raw OpenCL > buffers. My suggestion last year was that we 'wrap' pointers to raw > memory buffers into something like > struct generic_ptr { > void * cpu_ptr; > void * cuda_ptr; > cl_mem opencl_ptr;

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Jed Brown
Matthew Knepley writes: > Okay, here is how I understand GPU matrix assembly. The only way it > makes sense to me is in COO format which you may later convert. In > mpiaijAssemble.cu I have code that > > - Produces COO rows > - Segregates them into on and off-process rows These users compute

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Karl Rupp
Hi Jed, > We have some motivated users that would like a way to assemble matrices on a device, without needing to store all the element matrices to global memory or to transfer them to the CPU. Given GPU execution models, this means we need something that can be done on-the-spot in kernels. So

Re: [petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Matthew Knepley
On Mon, Sep 23, 2013 at 12:30 PM, Jed Brown wrote: > We have some motivated users that would like a way to assemble matrices > on a device, without needing to store all the element matrices to global > memory or to transfer them to the CPU. Given GPU execution models, this > means we need someth

[petsc-dev] Supporting OpenCL matrix assembly

2013-09-23 Thread Jed Brown
We have some motivated users that would like a way to assemble matrices on a device, without needing to store all the element matrices to global memory or to transfer them to the CPU. Given GPU execution models, this means we need something that can be done on-the-spot in kernels. So what about a