On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <roh...@alumni.cmu.edu> wrote:
> Hi Junchao, > > Thanks for the response! > > > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to convert > a Matrix Market file into a petsc binary file. And then in your test, > load the binary matrix, following this example > https://petsc.org/main/src/mat/tutorials/ex1.c.html > > I tried an example like this, but the performance was too slow (it would > process ~2000-3000 calls to `SetValue` a second), which is not reasonable > for loading matrices with millions of non-zeros. > The matrix market file in text format is not good for load. One should convert it to petsc binary format (only once), and use the new binary file afterwards. > > > I don't know what "No Races" means, but it seems you'd better also > verify the result of SpMV. > > This is a correct implementation of SpMV. The no-races is fine as it > parallelizes over the rows of the matrix, and thus does not need > synchronization between writes to the output. > > > You can think petsc's default CSR spmv is the baseline, which is done > in ~10 lines of code. > > I'm sorry, but I don't think that is a reasonable statement w.r.t to the > lines of code making it a good baseline. The TACO compiler also can be used > in 10 lines of code to compute an SpMV, or any other state-of-the-art > library could wrap an SpMV implementation behind a single function call. > I'm wondering if this performance I'm seeing using PETSc is expected, or if > I've misconfigured or am misusing the system in some way. > I meant 10 lines of code without any function call, which can be thought of as a textbook implementation of SpMV. As a baseline, one can apply optimizations to it. PETSc does not do sophisticated sparse matrix optimization itself, instead it relies on third-party libraries. I remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can add an interface to it too. > Rohan > > > On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang <junchao.zh...@gmail.com> > wrote: > >> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <roh...@alumni.cmu.edu> >> wrote: >> >>> Hi, I’m Rohan, a student working on compilation techniques for >>> distributed tensor computations. I’m looking at using PETSc as a baseline >>> for experiments I’m running, and want to understand if I’m using PETSc as >>> it was intended to achieve high performance, and if the performance I’m >>> seeing is expected. Currently, I’m just looking at SpMV operations. >>> >>> >>> My experiments are run on the Lassen Supercomputer ( >>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has 40 >>> CPUs, 4 V100s and an Infiniband interconnect. A visualization of the >>> architecture is here: >>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png >>> . >>> >>> >>> As of now, I’m trying to understand the single-node performance of >>> PETSc, as the scaling performance onto multiple nodes appears to be as I >>> expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse matrix >>> collection, detailed here: https://sparse.tamu.edu/LAW/arabic-2005. As >>> a trusted baseline, I am comparing against SpMV code generated by the TACO >>> compiler ( >>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races) >>> . >>> >> I don't know what "No Races" means, but it seems you'd better also verify >> the result of SpMV. >> >>> >>> My experiments find that PETSc is roughly 4 times slower on a single >>> thread and node than the kernel generated by TACO: >>> >>> >>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms. >>> >>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms. >>> >> You can think petsc's default CSR spmv is the baseline, which is done in >> ~10 lines of code. >> >>> >>> My code using PETSc is here: >>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38 >>> . >>> >>> >>> Runs from 1 thread and 1 node with -log_view are attached to the email. >>> The command lines for each were as follows: >>> >>> >>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n 20 >>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >>> >>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark -n 20 >>> -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >>> >>> >>> >>> In addition to these benchmarking concerns, I wanted to share my >>> experiences trying to load data from Matrix Market files into PETSc, which >>> ended up 1being much more difficult than I anticipated. Essentially, trying >>> to iterate through the Matrix Market files and using `write` to insert >>> entries into a `Mat` was extremely slow. In order to get reasonable >>> performance, I had to use an external utility to basically construct a CSR >>> matrix, and then pass the arrays from the CSR Matrix into >>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on PETSc >>> forums or Google, so I wanted to know if this was the right way to go. >>> >>> >>> Thanks, >>> >>> >>> Rohan Yadav >>> >>