Karl Rupp writes:
> Hi,
>
>>> If the context and queue are not attached to objects, then they would
>>> essentially represent global state, which is something I want to avoid.
>>
>> I was thinking that the context returned would be specific to the Mat
>> and the device it was about to run on.
>
>
Hi,
Fair enough. Is a brute-force implementation for P1 elements
sufficient as a baseline for discussion?
src//ksp/ksp/examples/tutorials/ex4.c
Ok, thanks, that's the COO part of the comparison then. I'll need to
provide the CSR-like case then. :-)
Best regards,
Karli
Hi Matt,
Here I believe strongly that we need tests. Nathan assured me that
nothing is faster on the GPU than sort+reduce-by-key since
they are highly optimized. I think they will be hard to beat, and the
initial timings I had say that this is the case. I am willing to be
wrong, but I am not wil
Hi Dominic,
I think you were referring to the 'Mat' on the device, while I was
referring to the plain PETSc Mat. The difficulty for a 'Mat' on the
device is a limitation of OpenCL in defining opaque types: It is not
possible to have something like
typedef struct OpenCLMat {
__global int row_
On Tue, Sep 24, 2013 at 8:11 AM, Karl Rupp wrote:
> Hi Matt,
>
>
> Here I believe strongly that we need tests. Nathan assured me that
>> nothing is faster on the GPU than sort+reduce-by-key since
>> they are highly optimized. I think they will be hard to beat, and the
>> initial timings I had sa
Hi,
If the context and queue are not attached to objects, then they would
essentially represent global state, which is something I want to avoid.
I was thinking that the context returned would be specific to the Mat
and the device it was about to run on.
Users who want to do the assembly rig
Hi,
I think you were referring to the 'Mat' on the device, while I was
referring to the plain PETSc Mat. The difficulty for a 'Mat' on the
device is a limitation of OpenCL in defining opaque types: It is not
possible to have something like
typedef struct OpenCLMat {
__global int row_indic
Karl Rupp writes:
>> Okay, but why do they need to provide their own "Mat" data?
>
> If the context and queue are not attached to objects, then they would
> essentially represent global state, which is something I want to avoid.
I was thinking that the context returned would be specific to the
On Tue, Sep 24, 2013 at 7:07 AM, Karl Rupp wrote:
> Hey,
>
>
> On 09/24/2013 03:53 PM, Jed Brown wrote:
>
>> Karl Rupp writes:
>>
>>> I'm not talking about CSR vs. COO from the SpMV point of view, but
>>> rather on how to store the actual data in global memory without
>>> expensive subsequent so
Hey,
On 09/24/2013 03:53 PM, Jed Brown wrote:
Karl Rupp writes:
I'm not talking about CSR vs. COO from the SpMV point of view, but
rather on how to store the actual data in global memory without
expensive subsequent sorts.
Sure, but this seems like such a minor detail. With PetscScalar=doub
Hi,
Perhaps that *GetSource method should also return an opaque device "Mat"
pointer that the user is responsible for shepherding into the kernel
From which they call the device MatSetValues?
This is easy of the OpenCL management is within PETSc (i.e. context,
buffers and command queues mana
Karl Rupp writes:
> I'm not talking about CSR vs. COO from the SpMV point of view, but
> rather on how to store the actual data in global memory without
> expensive subsequent sorts.
Sure, but this seems like such a minor detail. With PetscScalar=double
and PetscInt=int, we have 16 bytes/entry
Hey,
>> My primary metric for GPU kernels is memory transfers from global memory
('flops are free'), hence what I suggest for the assembly stage is to go
with something CSR-like rather than COO. Pure CSR may be too expensive
in terms of element lookup if there are several fields involved
(partic
Karl Rupp writes:
> Hey,
>
>> Perhaps that *GetSource method should also return an opaque device "Mat"
>> pointer that the user is responsible for shepherding into the kernel
>> From which they call the device MatSetValues?
>
> This is easy of the OpenCL management is within PETSc (i.e. context,
On Tue, Sep 24, 2013 at 2:45 AM, Jed Brown wrote:
> Karl Rupp writes:
>
> >>> This can obviously be done incrementally, so storing a batch of
> >>> element matrices to global memory is not a problem.
> >>
> >> If you store element matrices to global memory, you're using a ton of
> >> bandwidth (
Karl Rupp writes:
>>> This can obviously be done incrementally, so storing a batch of
>>> element matrices to global memory is not a problem.
>>
>> If you store element matrices to global memory, you're using a ton of
>> bandwidth (about 20x the size of the matrix if using P1 tets).
>>
>> What if
Hey,
Perhaps that *GetSource method should also return an opaque device "Mat"
pointer that the user is responsible for shepherding into the kernel
From which they call the device MatSetValues?
This is easy of the OpenCL management is within PETSc (i.e. context,
buffers and command queues man
This can obviously be done incrementally, so storing a batch of
element matrices to global memory is not a problem.
If you store element matrices to global memory, you're using a ton of
bandwidth (about 20x the size of the matrix if using P1 tets).
What if you do the sort/reduce thing within
Matthew Knepley writes:
>> These users compute redundantly and set MAT_NO_OFF_PROC_ENTRIES.
>
>
> Fine, we should have a flag like that.
We do, it's called MAT_NO_OFF_PROC_ENTRIES.
>> What if you do the sort/reduce thing within thread blocks, and only
>> write the reduced version to global stor
On Mon, Sep 23, 2013 at 2:46 PM, Jed Brown wrote:
> Matthew Knepley writes:
>
> > Okay, here is how I understand GPU matrix assembly. The only way it
> > makes sense to me is in COO format which you may later convert. In
> > mpiaijAssemble.cu I have code that
> >
> > - Produces COO rows
> >
Karl Rupp writes:
> a)
> I think this needs a second thought on how we manage the raw OpenCL
> buffers. My suggestion last year was that we 'wrap' pointers to raw
> memory buffers into something like
> struct generic_ptr {
> void * cpu_ptr;
> void * cuda_ptr;
> cl_mem opencl_ptr;
Matthew Knepley writes:
> Okay, here is how I understand GPU matrix assembly. The only way it
> makes sense to me is in COO format which you may later convert. In
> mpiaijAssemble.cu I have code that
>
> - Produces COO rows
> - Segregates them into on and off-process rows
These users compute
Hi Jed,
> We have some motivated users that would like a way to assemble matrices
on a device, without needing to store all the element matrices to global
memory or to transfer them to the CPU. Given GPU execution models, this
means we need something that can be done on-the-spot in kernels. So
On Mon, Sep 23, 2013 at 12:30 PM, Jed Brown wrote:
> We have some motivated users that would like a way to assemble matrices
> on a device, without needing to store all the element matrices to global
> memory or to transfer them to the CPU. Given GPU execution models, this
> means we need someth
We have some motivated users that would like a way to assemble matrices
on a device, without needing to store all the element matrices to global
memory or to transfer them to the CPU. Given GPU execution models, this
means we need something that can be done on-the-spot in kernels. So
what about a
25 matches
Mail list logo