Hey,

>> My primary metric for GPU kernels is memory transfers from global memory
('flops are free'), hence what I suggest for the assembly stage is to go
with something CSR-like rather than COO. Pure CSR may be too expensive
in terms of element lookup if there are several fields involved
(particularly 3d), so one could push (column-index, value) pairs for
each row and making the merge-by-key much cheaper than for arbitrary COO
matrices.

I think CSR vs. COO is a second-order optimization to be considered
after the 20x redundancy has been eliminated and a synchronization
strategy has been chosen (e.g., coloring vs redundant storage and later
compression).

I'm not talking about CSR vs. COO from the SpMV point of view, but rather on how to store the actual data in global memory without expensive subsequent sorts.


This, of course, requires the knowledge of the nonzero pattern and
couplings among elements, yet this is reasonably cheap to extract for a
large number of problems (for example, (non)linear PDEs without
adaptivity). Also, the nonzero pattern is rather cheap to obtain if one
uses coloring for avoiding expensive atomic writes to global memory.

At this point, I don't mind having the nonzero pattern set ahead of time
using CPU code.  It's reassembly in time-dependent problems with no
adaptivity or occasional adaptivity that I'm more concerned with.

Okay, this makes things a lot easier :-)

Best regards,
Karli

Reply via email to