Hi Paul,
> Not too heavy. I've already converted much of this code to remove this
package while supporting existing features, though I haven't pushed it
into the fork. The real question is whether we want to go down this path
or not.
I see two options: Either txpetscgpu is a self-contained package and
brings its own set of implementation files along, or it should be
integrated. The current model of injected
PETSC_HAVE_TXPETSCGPU-preprocessor switches will not be able to compete
in any code beauty contest... ;-) Either way, there is presumably also
some licensing issue involved, so you guys need to agree to have
txpetscgpu integrated (or not).
Right now, I think CUSP does not support SpMVs in streams. Thus, in
order to get an effective multi GPU SpMV (for all the different storage
formats), one has to rewrite all the SpMV kernels (for all the different
storage formats) to use streams. This adds a lot of additional code to
support. I would prefer to just call some CUSP API with a stream as an
input argument but I don't think that exists at the moment. I'm not sure
what to do here. Once the other code is accepted, perhaps we can
address this problem then?
The CUSP API needs to provide streams for that, yes.
As I addressed in my comments on your commits on Bitbucket, I'd prefer
to see CUSP being separated from CUSPARSE and instead use a
CUSPARSE-native matrix datastructure (a simple collection of handles).
This way one can already use the CUSPARSE interface if only the CUDA SDK
is installed, and hook in CUSP later for preconditioners, etc.
It works across node but you have to know what you're doing. This is a
tough problem to solve universally because its (almost) impossible to
determine the number of mpi ranks per node in an mpi run. I've never
seen an MPI function that returns this information.
Right now, a 1-1 pairing between CPU core and GPU will work across any
system with any number of nodes. I've tested this on a system with 2
nodes, 4 GPUs per node (so "mpirun -n 8 -npernode 4" would work)
Thanks, I see. Apparently I'm not the only one struggling with this
abstraction issue...
Best regards,
Karli