You are likely prematurely optimizing. Are you already solving systems
using the new solver on "real" problems? Does profiling indicate that the
duplicate memory usage or time to copy are limiting the problems you can run?
I highly recommend finishing and using your solver interface well
Karl Rupp writes:
> Hi,
>
>>> If the context and queue are not attached to objects, then they would
>>> essentially represent global state, which is something I want to avoid.
>>
>> I was thinking that the context returned would be specific to the Mat
>> and the device it was about to run on.
>
>
On Tue, Sep 24, 2013 at 9:07 AM, Jose David Bermeol wrote:
> Hi, I have some questions?
>
> 1) What is the main reason to split the matrix in each MPI process in
> diagonal matrix and off diagonal matrix?
>
To overlap communication with computation.
> 2) Is this just for MATMPIAIJ matrices?
>
Hi, I have some questions?
1) What is the main reason to split the matrix in each MPI process in diagonal
matrix and off diagonal matrix?
2) Is this just for MATMPIAIJ matrices?
3) Right now we are interested in safe all the memory as possible, so I guess
the right path would be to implement a
Matt,
The problem results with 1-4 cores. I have only observed the error when
using GAMG.
I do not get any error with:
-ksp_type bcgsl -lspaint_pc_type hypre -lspaint_pc_hypre_type boomeramg
or
-ksp_type bcgsl
I do get an error with
-ksp_type bcgsl -pc_type gamg -pc_gamg_threshold 0.01 -pc_ga
Hi,
Fair enough. Is a brute-force implementation for P1 elements
sufficient as a baseline for discussion?
src//ksp/ksp/examples/tutorials/ex4.c
Ok, thanks, that's the COO part of the comparison then. I'll need to
provide the CSR-like case then. :-)
Best regards,
Karli
Hi Matt,
Here I believe strongly that we need tests. Nathan assured me that
nothing is faster on the GPU than sort+reduce-by-key since
they are highly optimized. I think they will be hard to beat, and the
initial timings I had say that this is the case. I am willing to be
wrong, but I am not wil
Hi Dominic,
I think you were referring to the 'Mat' on the device, while I was
referring to the plain PETSc Mat. The difficulty for a 'Mat' on the
device is a limitation of OpenCL in defining opaque types: It is not
possible to have something like
typedef struct OpenCLMat {
__global int row_
On Tue, Sep 24, 2013 at 8:11 AM, Karl Rupp wrote:
> Hi Matt,
>
>
> Here I believe strongly that we need tests. Nathan assured me that
>> nothing is faster on the GPU than sort+reduce-by-key since
>> they are highly optimized. I think they will be hard to beat, and the
>> initial timings I had sa
Hi,
If the context and queue are not attached to objects, then they would
essentially represent global state, which is something I want to avoid.
I was thinking that the context returned would be specific to the Mat
and the device it was about to run on.
Users who want to do the assembly rig
Hi,
I think you were referring to the 'Mat' on the device, while I was
referring to the plain PETSc Mat. The difficulty for a 'Mat' on the
device is a limitation of OpenCL in defining opaque types: It is not
possible to have something like
typedef struct OpenCLMat {
__global int row_indic
John Mousel writes:
> I cloned a fresh repo and built from scratch. I'm seeing the same error I
> previously reported with both intel and gnu compilers.
Can you reproduce with a PETSc example or create a test case so that we
can reproduce?
> Also, I get warnings in MatSOR_SeqAIJ_Inode during t
On Tue, Sep 24, 2013 at 7:49 AM, John Mousel wrote:
> I cloned a fresh repo and built from scratch. I'm seeing the same error I
> previously reported with both intel and gnu compilers. Also, I get warnings
> in MatSOR_SeqAIJ_Inode during the build.
>
Do you get the error in serial? Do you get er
I cloned a fresh repo and built from scratch. I'm seeing the same error I
previously reported with both intel and gnu compilers. Also, I get warnings
in MatSOR_SeqAIJ_Inode during the build.
src/mat/impls/aij/seq/inode.c: In function ‘MatSOR_SeqAIJ_Inode’:
src/mat/impls/aij/seq/inode.c:2758: warni
Karl Rupp writes:
>> Okay, but why do they need to provide their own "Mat" data?
>
> If the context and queue are not attached to objects, then they would
> essentially represent global state, which is something I want to avoid.
I was thinking that the context returned would be specific to the
On Tue, Sep 24, 2013 at 7:07 AM, Karl Rupp wrote:
> Hey,
>
>
> On 09/24/2013 03:53 PM, Jed Brown wrote:
>
>> Karl Rupp writes:
>>
>>> I'm not talking about CSR vs. COO from the SpMV point of view, but
>>> rather on how to store the actual data in global memory without
>>> expensive subsequent so
Hey,
On 09/24/2013 03:53 PM, Jed Brown wrote:
Karl Rupp writes:
I'm not talking about CSR vs. COO from the SpMV point of view, but
rather on how to store the actual data in global memory without
expensive subsequent sorts.
Sure, but this seems like such a minor detail. With PetscScalar=doub
Hi,
Perhaps that *GetSource method should also return an opaque device "Mat"
pointer that the user is responsible for shepherding into the kernel
From which they call the device MatSetValues?
This is easy of the OpenCL management is within PETSc (i.e. context,
buffers and command queues mana
Karl Rupp writes:
> I'm not talking about CSR vs. COO from the SpMV point of view, but
> rather on how to store the actual data in global memory without
> expensive subsequent sorts.
Sure, but this seems like such a minor detail. With PetscScalar=double
and PetscInt=int, we have 16 bytes/entry
Hey,
>> My primary metric for GPU kernels is memory transfers from global memory
('flops are free'), hence what I suggest for the assembly stage is to go
with something CSR-like rather than COO. Pure CSR may be too expensive
in terms of element lookup if there are several fields involved
(partic
Karl Rupp writes:
> Hey,
>
>> Perhaps that *GetSource method should also return an opaque device "Mat"
>> pointer that the user is responsible for shepherding into the kernel
>> From which they call the device MatSetValues?
>
> This is easy of the OpenCL management is within PETSc (i.e. context,
On Tue, Sep 24, 2013 at 2:45 AM, Jed Brown wrote:
> Karl Rupp writes:
>
> >>> This can obviously be done incrementally, so storing a batch of
> >>> element matrices to global memory is not a problem.
> >>
> >> If you store element matrices to global memory, you're using a ton of
> >> bandwidth (
Karl Rupp writes:
>>> This can obviously be done incrementally, so storing a batch of
>>> element matrices to global memory is not a problem.
>>
>> If you store element matrices to global memory, you're using a ton of
>> bandwidth (about 20x the size of the matrix if using P1 tets).
>>
>> What if
Hey,
Perhaps that *GetSource method should also return an opaque device "Mat"
pointer that the user is responsible for shepherding into the kernel
From which they call the device MatSetValues?
This is easy of the OpenCL management is within PETSc (i.e. context,
buffers and command queues man
This can obviously be done incrementally, so storing a batch of
element matrices to global memory is not a problem.
If you store element matrices to global memory, you're using a ton of
bandwidth (about 20x the size of the matrix if using P1 tets).
What if you do the sort/reduce thing within
25 matches
Mail list logo