On Sat, Sep 25, 2021 at 4:45 PM Mark Adams <mfad...@lbl.gov> wrote: > I am testing my Landau code, which is MPI serial, but with many > independent MPI processes driving each GPU, in an MPI parallel harness code > (Landau ex2). > > Vector operations with Kokkos Kernels and cuSparse are about the same (KK > is faster) and a bit expensive with one process / GPU. About the same as my > Jacobian construction, which is expensive but optimized on the GPU. (I am > using arkimex adaptive TS. I am guessing that it does a lot of vector ops, > because there are a lot.) > > With 14 or 15 processes, all doing the same MPI serial problem, cuSparse > is about 2.5x more expensive than KK. KK does degrad by about 15% from the > one processor case. So KK is doing fine, but something bad is > happening with cuSparse. > AIJKOKKOS and AIJCUSPARSE have different algorithms? I don't know. To know exactly, the best approach is to consult with Peng@nvidia to profile the code.
> > Anyone have any thoughts on this? > > Thanks, > Mark > >