Hey Karl,

Not too heavy. I've already converted much of this code to remove this package while supporting existing features, though I haven't pushed it into the fork. The real question is whether we want to go down this path or not.

Right now, I think CUSP does not support SpMVs in streams. Thus, in order to get an effective multi GPU SpMV (for all the different storage formats), one has to rewrite all the SpMV kernels (for all the different storage formats) to use streams. This adds a lot of additional code to support. I would prefer to just call some CUSP API with a stream as an input argument but I don't think that exists at the moment. I'm not sure what to do here. Once the other code is accepted, perhaps we can address this problem then?


How 'heavy' is this dependency? Is there some 'blocker' which prevents a complete integration, or is it just not finished yet?


I would like to start a discussion on the changes I have made which
primarily affect aijcusparse, mpiaijcusparse, and veccusp. There are two
commits that need to be reviewed:



Alright, I'll comment there.


I think (1) should be reviewed first as this adds most of the serial GPU
capability to aijcusparse. The second commit (2) adds changes to veccusp
and mpiaijcusparse to get an efficient multi-GPU SpMV.


It works across node but you have to know what you're doing. This is a tough problem to solve universally because its (almost) impossible to determine the number of mpi ranks per node in an mpi run. I've never seen an MPI function that returns this information.

Right now, a 1-1 pairing between CPU core and GPU will work across any system with any number of nodes. I've tested this on a system with 2 nodes, 4 GPUs per node (so "mpirun -n 8 -npernode 4" would work)

Thanks,
-Paul
I assume that this only works on a single node in order to enumerate and initialize the GPUs correctly?

Best regards,
Karli

Reply via email to