On Sat, Dec 11, 2021 at 8:25 PM Junchao Zhang <junchao.zh...@gmail.com> wrote:
> I expected TACO was better since its website says "It uses novel compiler > techniques to get performance competitive with hand-optimized kernels" > I would not. SpMV is just a bandwidth test. There is almost nothing else going on. As Barry says, if you run a stencil matrix, I don't think you could outperform the naive PETSc by more than 15%. That is what the thousands of SpMV papers also show. If you use one of these small-world graph matrices, there can be larger gaps. Matt > --Junchao Zhang > > > On Sat, Dec 11, 2021 at 5:56 PM Rohan Yadav <roh...@alumni.cmu.edu> wrote: > >> Sorry, what’s surprising about this? 40 mpi ranks on a single node should >> be similar performance as 40 threads. Both petsc and taco are doing a >> row-based parallelism strategy so it should line up. >> >> Rohan Yadav >> >> On Dec 11, 2021, at 6:44 PM, Junchao Zhang <junchao.zh...@gmail.com> >> wrote: >> >> >> >> On Sat, Dec 11, 2021 at 5:09 PM Rohan Yadav <roh...@alumni.cmu.edu> >> wrote: >> >>> > Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close >>> to 1 thread or 40 threads of TACO? >>> >>> The 1 rank time is the same as taco 1 thread, and the 40 rank time is >>> the same as taco 40 threads. >>> >> Interesting. TACO is supposed to give an optimized SpMV. >> >> >>> >>> Rohan >>> >>> On Sat, Dec 11, 2021 at 6:07 PM Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>>> >>>> >>>> On Sat, Dec 11, 2021, 4:22 PM Rohan Yadav <roh...@alumni.cmu.edu> >>>> wrote: >>>> >>>>> Thanks all for the help, the main problem was the lack of optimization >>>>> flags in the default build provided by my system. A manual installation >>>>> with optimization flags delivers performance equal to the single node >>>>> benchmark I discussed before. >>>>> >>>> Did you mean with 1 rank or 40 mpi ranks, petsc's performance is close >>>> to 1 thread or 40 threads of TACO? >>>> >>>>> >>>>> Rohan >>>>> >>>>> On Sat, Dec 11, 2021 at 4:04 PM Rohan Yadav <roh...@alumni.cmu.edu> >>>>> wrote: >>>>> >>>>>> > The matrix market file in text format is not good for load. One >>>>>> should convert it to petsc binary format (only once), and use the new >>>>>> binary file afterwards. >>>>>> >>>>>> Yes, I understand this. The point I'm trying to make is that using >>>>>> PETSc to even perform the initial conversion from matrix market to the >>>>>> binary format was prohibitively slow using `MatSetValues`. >>>>>> >>>>>> > I meant 10 lines of code without any function call, which can be >>>>>> thought of as a textbook implementation of SpMV. As a baseline, one can >>>>>> apply optimizations to it. PETSc does not do sophisticated sparse matrix >>>>>> optimization itself, instead it relies on third-party libraries. I >>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, >>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can >>>>>> add >>>>>> an interface to it too. >>>>>> >>>>>> Yes, this is what I expected. Given that PETSc uses high-performance >>>>>> kernels for for the sparse matrix operation itself, I was surprised to >>>>>> see >>>>>> that the single-thread performance of PETSc to be closer to a baseline >>>>>> like >>>>>> TACO. This performance will likely improve when I compile PETSc with >>>>>> optimization flags. >>>>>> >>>>>> Rohan >>>>>> >>>>>> On Sat, Dec 11, 2021 at 1:04 PM Junchao Zhang < >>>>>> junchao.zh...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Dec 11, 2021 at 10:28 AM Rohan Yadav <roh...@alumni.cmu.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Junchao, >>>>>>>> >>>>>>>> Thanks for the response! >>>>>>>> >>>>>>>> > You can use https://petsc.org/main/src/mat/tests/ex72.c.html to >>>>>>>> convert a Matrix Market file into a petsc binary file. And then in >>>>>>>> your test, load the binary matrix, following this example >>>>>>>> https://petsc.org/main/src/mat/tutorials/ex1.c.html >>>>>>>> >>>>>>>> I tried an example like this, but the performance was too slow (it >>>>>>>> would process ~2000-3000 calls to `SetValue` a second), which is not >>>>>>>> reasonable for loading matrices with millions of non-zeros. >>>>>>>> >>>>>>> The matrix market file in text format is not good for load. One >>>>>>> should convert it to petsc binary format (only once), and use the new >>>>>>> binary file afterwards. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> > I don't know what "No Races" means, but it seems you'd better >>>>>>>> also verify the result of SpMV. >>>>>>>> >>>>>>>> This is a correct implementation of SpMV. The no-races is fine as >>>>>>>> it parallelizes over the rows of the matrix, and thus does not need >>>>>>>> synchronization between writes to the output. >>>>>>>> >>>>>>>> > You can think petsc's default CSR spmv is the baseline, which >>>>>>>> is done in ~10 lines of code. >>>>>>>> >>>>>>>> I'm sorry, but I don't think that is a reasonable statement w.r.t >>>>>>>> to the lines of code making it a good baseline. The TACO compiler also >>>>>>>> can >>>>>>>> be used in 10 lines of code to compute an SpMV, or any other >>>>>>>> state-of-the-art library could wrap an SpMV implementation behind a >>>>>>>> single >>>>>>>> function call. I'm wondering if this performance I'm seeing using >>>>>>>> PETSc is >>>>>>>> expected, or if I've misconfigured or am misusing the system in some >>>>>>>> way. >>>>>>>> >>>>>>> I meant 10 lines of code without any function call, which can be >>>>>>> thought of as a textbook implementation of SpMV. As a baseline, one can >>>>>>> apply optimizations to it. PETSc does not do sophisticated sparse >>>>>>> matrix >>>>>>> optimization itself, instead it relies on third-party libraries. I >>>>>>> remember we had OSKI from Berkeley for CPU, and on GPU we use cuSparse, >>>>>>> hipSparse, MKLSparse or Kokkos-Kernels. If TACO is good, then petsc can >>>>>>> add >>>>>>> an interface to it too. >>>>>>> >>>>>>> >>>>>>>> Rohan >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Dec 10, 2021 at 11:39 PM Junchao Zhang < >>>>>>>> junchao.zh...@gmail.com> wrote: >>>>>>>> >>>>>>>>> On Fri, Dec 10, 2021 at 8:05 PM Rohan Yadav <roh...@alumni.cmu.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, I’m Rohan, a student working on compilation techniques for >>>>>>>>>> distributed tensor computations. I’m looking at using PETSc as a >>>>>>>>>> baseline >>>>>>>>>> for experiments I’m running, and want to understand if I’m using >>>>>>>>>> PETSc as >>>>>>>>>> it was intended to achieve high performance, and if the performance >>>>>>>>>> I’m >>>>>>>>>> seeing is expected. Currently, I’m just looking at SpMV operations. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> My experiments are run on the Lassen Supercomputer ( >>>>>>>>>> https://hpc.llnl.gov/hardware/platforms/lassen). The system has >>>>>>>>>> 40 CPUs, 4 V100s and an Infiniband interconnect. A visualization of >>>>>>>>>> the >>>>>>>>>> architecture is here: >>>>>>>>>> https://hpc.llnl.gov/sites/default/files/power9-AC922systemDiagram2_1.png >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> As of now, I’m trying to understand the single-node performance >>>>>>>>>> of PETSc, as the scaling performance onto multiple nodes appears to >>>>>>>>>> be as I >>>>>>>>>> expect. I’m using the arabic-2005 sparse matrix from the SuiteSparse >>>>>>>>>> matrix >>>>>>>>>> collection, detailed here: >>>>>>>>>> https://sparse.tamu.edu/LAW/arabic-2005. As a trusted baseline, >>>>>>>>>> I am comparing against SpMV code generated by the TACO compiler ( >>>>>>>>>> http://tensor-compiler.org/codegen.html?expr=y(i)%20=%20A(i,j)%20*%20x(j)&format=y:d:0;A:ds:0,1;x:d:0&sched=split:i:i0:i1:32;reorder:i0:i1:j;parallelize:i0:CPU%20Thread:No%20Races) >>>>>>>>>> . >>>>>>>>>> >>>>>>>>> I don't know what "No Races" means, but it seems you'd better also >>>>>>>>> verify the result of SpMV. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> My experiments find that PETSc is roughly 4 times slower on a >>>>>>>>>> single thread and node than the kernel generated by TACO: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> PETSc: 1 Thread: 5694.72 ms, 1 Node 40 threads: 262.6 ms. >>>>>>>>>> >>>>>>>>>> TACO: 1 Thread: 1341 ms, 1 Node 40 threads: 86 ms. >>>>>>>>>> >>>>>>>>> You can think petsc's default CSR spmv is the baseline, which is >>>>>>>>> done in ~10 lines of code. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> My code using PETSc is here: >>>>>>>>>> https://github.com/rohany/taco/blob/9e0e30b16bfba5319b15b2d1392f35376952f838/petsc/benchmark.cpp#L38 >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Runs from 1 thread and 1 node with -log_view are attached to the >>>>>>>>>> email. The command lines for each were as follows: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 1 node 1 thread: `jsrun -n 1 -c 1 -r 1 -b rs ./bin/benchmark -n >>>>>>>>>> 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >>>>>>>>>> >>>>>>>>>> 1 node 40 threads: `jsrun -n 40 -c 1 -r 40 -b rs ./bin/benchmark >>>>>>>>>> -n 20 -warmup 10 -matrix $TENSOR_DIR/arabic-2005.petsc -log_view` >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> In addition to these benchmarking concerns, I wanted to share my >>>>>>>>>> experiences trying to load data from Matrix Market files into PETSc, >>>>>>>>>> which >>>>>>>>>> ended up 1being much more difficult than I anticipated. Essentially, >>>>>>>>>> trying >>>>>>>>>> to iterate through the Matrix Market files and using `write` to >>>>>>>>>> insert >>>>>>>>>> entries into a `Mat` was extremely slow. In order to get reasonable >>>>>>>>>> performance, I had to use an external utility to basically construct >>>>>>>>>> a CSR >>>>>>>>>> matrix, and then pass the arrays from the CSR Matrix into >>>>>>>>>> `MatCreateSeqAIJWithArrays`. I couldn’t find any more guidance on >>>>>>>>>> PETSc >>>>>>>>>> forums or Google, so I wanted to know if this was the right way to >>>>>>>>>> go. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Rohan Yadav >>>>>>>>>> >>>>>>>>> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>