On Sat, Sep 25, 2021 at 8:12 PM Junchao Zhang <junchao.zh...@gmail.com> wrote:
> > > > On Sat, Sep 25, 2021 at 4:45 PM Mark Adams <mfad...@lbl.gov> wrote: > >> I am testing my Landau code, which is MPI serial, but with many >> independent MPI processes driving each GPU, in an MPI parallel harness code >> (Landau ex2). >> >> Vector operations with Kokkos Kernels and cuSparse are about the same (KK >> is faster) and a bit expensive with one process / GPU. About the same as my >> Jacobian construction, which is expensive but optimized on the GPU. (I am >> using arkimex adaptive TS. I am guessing that it does a lot of vector ops, >> because there are a lot.) >> >> With 14 or 15 processes, all doing the same MPI serial problem, cuSparse >> is about 2.5x more expensive than KK. KK does degrad by about 15% from the >> one processor case. So KK is doing fine, but something bad is >> happening with cuSparse. >> > AIJKOKKOS and AIJCUSPARSE have different algorithms? I don't know. To > know exactly, the best approach is to consult with Peng@nvidia to profile > the code. > Yea, I could ask Peng if he has any thoughts. I am also now having a problem with snes/tests/ex13 scaling study (for my ECP report). The cuSparse version of GAMG is hanging on an 8 node job with a refinement of 3. It works on one node with a refinement of 4 and on 8 nodes with a refinement of 2. I recently moved from CUDA-10 to CUDA-11 on summit because MPS seems to be working with CUDA-11 whereas it was not a while ago. I think I will try going back to CUDA-10 and see if I see anything change. Thanks, Mark > > >> >> Anyone have any thoughts on this? >> >> Thanks, >> Mark >> >>