The flop rate of 123 megaflops is absurdly small, something unexected must be happening. Any chance you can send me small code that reproduces the below?
> On Nov 11, 2023, at 9:19 PM, Donald Rex Planalp <donald.plan...@colorado.edu> > wrote: > > I've isolated the profiling to only run during the addition of the two > matrices. The output is > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatAssemblyBegin 1 1.0 4.5000e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 7.4370e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 > MatAXPY 1 1.0 6.6962e-01 1.0 8.24e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 31 0 0 0 100 100 0 0 0 123 > ------------------------------------------------------------------------------------------------------------------------ > > On Sat, Nov 11, 2023 at 6:20 PM Donald Rex Planalp > <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> wrote: >> My apologies I missed your other question, >> >> Yes I imagine most of those setValue and getRow are due to the construction >> of the matrices in question. I Ended up writing my own "semi-parallel" >> kronecker product function which involves setting and getting >> a lot of values. I imagine I should optimize that much more, but I consider >> it an upfront cost to time propagation, so I haven't touched it since I got >> it working. >> >> Thank you for the quick replies >> >> >> >> On Sat, Nov 11, 2023 at 6:12 PM Donald Rex Planalp >> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> wrote: >>> My apologies, that was run with 7 cores, this is the result for n = 1. >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>> ------------------------------------------------------------------------------------------------------------------------ >>> >>> --- Event Stage 0: Main Stage >>> >>> BuildTwoSided 13 1.0 1.4770e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> BuildTwoSidedF 13 1.0 2.3150e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecView 24 1.0 3.1525e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecCopy 12 1.0 4.1000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecAssemblyBegin 13 1.0 5.1290e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecAssemblyEnd 13 1.0 7.0200e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecSetRandom 2 1.0 6.2500e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMult 112 1.0 1.7071e-03 1.0 1.41e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 5 0 0 0 0 5 0 0 0 8268 >>> MatSolve 112 1.0 8.3160e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 3 0 0 0 0 3 0 0 0 8486 >>> MatLUFactorSym 2 1.0 1.2660e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatLUFactorNum 2 1.0 2.2895e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 4646 >>> MatScale 2 1.0 4.4903e-02 1.0 1.65e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 61 0 0 0 0 61 0 0 0 3671 >>> MatAssemblyBegin 169 1.0 4.5040e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatAssemblyEnd 169 1.0 2.2126e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 40 0 0 0 0 40 0 0 0 0 0 >>> MatSetValues 9078 1.0 6.1440e-01 1.0 8.24e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 31 0 0 0 1 31 0 0 0 134 >>> MatGetRow 9078 1.0 5.8936e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatGetRowIJ 2 1.0 9.0900e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatGetOrdering 2 1.0 3.6880e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatAXPY 1 1.0 6.4112e-01 1.0 8.24e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 31 0 0 0 1 31 0 0 0 129 >>> PCSetUp 2 1.0 4.6308e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2297 >>> PCApply 112 1.0 8.4632e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 3 0 0 0 0 3 0 0 0 8339 >>> KSPSetUp 2 1.0 4.5000e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> KSPSolve 112 1.0 8.9241e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 3 0 0 0 0 3 0 0 0 7908 >>> EPSSetUp 2 1.0 6.5904e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 1614 >>> EPSSolve 2 1.0 4.4982e-03 1.0 2.11e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 8 0 0 0 0 8 0 0 0 4687 >>> STSetUp 2 1.0 5.0441e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2109 >>> STComputeOperatr 2 1.0 2.5700e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> STApply 112 1.0 1.6852e-03 1.0 1.41e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 5 0 0 0 0 5 0 0 0 8375 >>> STMatSolve 112 1.0 9.1930e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 3 0 0 0 0 3 0 0 0 7677 >>> BVCopy 20 1.0 1.7930e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> BVMultVec 219 1.0 2.3467e-04 1.0 2.31e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 9861 >>> BVMultInPlace 12 1.0 1.1088e-04 1.0 1.09e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 9857 >>> BVDotVec 219 1.0 2.9956e-04 1.0 2.47e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 8245 >>> BVOrthogonalizeV 114 1.0 5.8952e-04 1.0 4.82e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 2 0 0 0 0 2 0 0 0 8181 >>> BVScale 114 1.0 2.9150e-05 1.0 4.06e+04 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 1392 >>> BVSetRandom 2 1.0 9.6900e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> BVMatMultVec 112 1.0 1.7464e-03 1.0 1.41e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 5 0 0 0 0 5 0 0 0 8082 >>> DSSolve 10 1.0 9.1354e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> DSVectors 24 1.0 1.1908e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> DSOther 30 1.0 1.2798e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> ------------------------------------------------------------------------------------------------------------------------ >>> >>> On Sat, Nov 11, 2023 at 5:59 PM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> How many MPI processes did you use? Please try with just one to get >>>> a base line >>>> >>>> MatSetValues 1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 >>>> 0.0e+00 1 32 0 0 0 1 32 0 0 0 669 >>>> MatGetRow 1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>> MatAXPY 1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 >>>> 6.0e+00 1 32 5 14 0 1 32 5 14 0 478 >>>> >>>> This is concerning: there are 1298 MatSetValues and MatGetRow, are you >>>> calling them? If not that means the MatAXPY is calling them (if >>>> SAME_NONZERO_PATTERN is not used these are used in the MatAXPY). >>>> >>>>> I'm still somewhat new to parallel computing, so I'm not sure what you >>>>> mean by binding. Does this lock in certain processes to certain cores? >>>> >>>> >>>> Yes, it is usually the right thing to do for numerical computing. I >>>> included a link to some discussion of it on petsc.org <http://petsc.org/> >>>> >>>>> On Nov 11, 2023, at 6:38 PM, Donald Rex Planalp >>>>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> wrote: >>>>> >>>>> Hello again, >>>>> >>>>> I've run the simulation with profiling again. In this setup I only ran >>>>> the necessary methods to construct the matrices, and then at the end I >>>>> added H_mix and H_ang using the Mataxpy method. Below are the results >>>>> >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> Event Count Time (sec) Flop >>>>> --- Global --- --- Stage ---- Total >>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> >>>>> --- Event Stage 0: Main Stage >>>>> >>>>> BuildTwoSided 350 1.0 6.9605e-01 1.9 0.00e+00 0.0 3.9e+02 5.6e+00 >>>>> 3.5e+02 3 0 24 0 22 3 0 24 0 22 0 >>>>> BuildTwoSidedF 236 1.0 6.9569e-01 1.9 0.00e+00 0.0 2.3e+02 1.1e+01 >>>>> 2.4e+02 3 0 14 0 15 3 0 14 0 15 0 >>>>> SFSetGraph 114 1.0 2.0555e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> SFSetUp 116 1.0 1.2123e-03 1.1 0.00e+00 0.0 6.2e+02 8.9e+02 >>>>> 1.1e+02 0 0 38 71 7 0 0 38 71 7 0 >>>>> SFPack 312 1.0 6.2350e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> SFUnpack 312 1.0 2.2470e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> VecView 26 1.0 4.0207e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> VecCopy 13 1.0 7.6100e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> VecAssemblyBegin 14 1.0 2.9690e-04 2.1 0.00e+00 0.0 2.3e+02 1.1e+01 >>>>> 1.4e+01 0 0 14 0 1 0 0 14 0 1 0 >>>>> VecAssemblyEnd 14 1.0 2.9400e-05 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> VecScatterBegin 312 1.0 6.0288e-04 1.6 0.00e+00 0.0 7.3e+02 3.1e+02 >>>>> 1.0e+02 0 0 45 29 6 0 0 45 29 7 0 >>>>> VecScatterEnd 312 1.0 9.1886e-04 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> VecSetRandom 2 1.0 3.6500e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatMult 104 1.0 2.5355e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 >>>>> 2.2e+02 0 1 49 29 14 0 1 49 29 14 70 >>>>> MatSolve 104 1.0 2.4431e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 >>>>> 1.1e+02 0 0 49 29 7 0 0 49 29 7 36 >>>>> MatLUFactorSym 2 1.0 1.2584e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatLUFactorNum 2 1.0 7.7804e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatScale 2 1.0 4.5068e-02 1.0 2.36e+07 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 65 0 0 0 0 65 0 0 0 3657 >>>>> MatAssemblyBegin 169 1.0 4.2296e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 1.1e+02 2 0 0 0 7 2 0 0 0 7 0 >>>>> MatAssemblyEnd 169 1.0 3.6979e+00 1.0 0.00e+00 0.0 5.9e+02 9.4e+02 >>>>> 7.8e+02 22 0 36 71 48 22 0 36 71 49 0 >>>>> MatSetValues 1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 1 32 0 0 0 1 32 0 0 0 669 >>>>> MatGetRow 1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> MatAXPY 1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 >>>>> 6.0e+00 1 32 5 14 0 1 32 5 14 0 478 >>>>> PCSetUp 2 1.0 2.0848e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 8.0e+00 0 0 0 0 0 0 0 0 0 1 0 >>>>> PCApply 104 1.0 2.4459e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 >>>>> 1.1e+02 0 0 49 29 7 0 0 49 29 7 36 >>>>> KSPSetUp 2 1.0 1.8400e-06 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> KSPSolve 104 1.0 2.5008e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 >>>>> 2.2e+02 0 0 49 29 14 0 0 49 29 14 35 >>>>> EPSSetUp 2 1.0 2.4027e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 1.8e+01 0 0 0 0 1 0 0 0 0 1 0 >>>>> EPSSolve 2 1.0 3.0789e-02 1.0 1.09e+06 1.1 8.0e+02 2.8e+02 >>>>> 4.5e+02 0 3 49 29 28 0 3 49 29 28 242 >>>>> STSetUp 2 1.0 2.1294e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 8.0e+00 0 0 0 0 0 0 0 0 0 1 0 >>>>> STComputeOperatr 2 1.0 2.1100e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> STApply 104 1.0 2.5290e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 >>>>> 2.2e+02 0 1 49 29 14 0 1 49 29 14 70 >>>>> STMatSolve 104 1.0 2.5081e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 >>>>> 2.2e+02 0 0 49 29 14 0 0 49 29 14 35 >>>>> BVCopy 20 1.0 2.8770e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> BVMultVec 206 1.0 1.8736e-04 1.5 3.13e+05 1.1 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 11442 >>>>> BVMultInPlace 11 1.0 9.2730e-05 1.2 1.49e+05 1.1 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 11026 >>>>> BVDotVec 206 1.0 1.0389e-03 1.2 3.35e+05 1.1 0.0e+00 0.0e+00 >>>>> 2.1e+02 0 1 0 0 13 0 1 0 0 13 2205 >>>>> BVOrthogonalizeV 106 1.0 1.2649e-03 1.1 6.84e+05 1.1 0.0e+00 0.0e+00 >>>>> 2.1e+02 0 2 0 0 13 0 2 0 0 13 3706 >>>>> BVScale 106 1.0 9.0329e-05 1.9 5.51e+03 1.1 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 418 >>>>> BVSetRandom 2 1.0 1.4080e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> BVMatMultVec 104 1.0 2.5468e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 >>>>> 2.2e+02 0 1 49 29 14 0 1 49 29 14 70 >>>>> DSSolve 9 1.0 1.2452e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> DSVectors 24 1.0 1.5345e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> DSOther 27 1.0 1.6735e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> >>>>> I believe that the only matrix addition I did is shown there to take >>>>> about 0.17 seconds. Doing so twice per timestep in addition to scaling, >>>>> this ends up taking about 0.4 seconds >>>>> during the full calculation so the timestep is far too slow. >>>>> >>>>> However it seems that the number of flops is on the order of 10^7. Would >>>>> this imply that >>>>> I am perhaps memory bandwidth bound on my home pc? >>>>> >>>>> I'm still somewhat new to parallel computing, so I'm not sure what you >>>>> mean by binding. Does this lock in certain processes to certain cores? >>>>> >>>>> Thank you for your time and assistance >>>>> >>>>> On Sat, Nov 11, 2023 at 2:54 PM Barry Smith <bsm...@petsc.dev >>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>> >>>>>> Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other >>>>>> formats are similar) >>>>>> >>>>>> PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure >>>>>> str) >>>>>> { >>>>>> Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data; >>>>>> PetscInt bs = Y->rmap->bs, bs2 = bs * bs; >>>>>> PetscBLASInt one = 1; >>>>>> .... >>>>>> if (str == SAME_NONZERO_PATTERN) { >>>>>> PetscScalar alpha = a; >>>>>> PetscBLASInt bnz; >>>>>> PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz)); >>>>>> PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, >>>>>> &one)); >>>>>> PetscCall(PetscObjectStateIncrease((PetscObject)Y)); >>>>>> >>>>>> It directly adds the nonzero values from the two matrices together (the >>>>>> nonzero structure plays no role) and it uses the BLAS so it should >>>>>> perform as "fast as possible" given the hardware (are you configured >>>>>> --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI >>>>>> is using the best combination of cores? >>>>>> https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup >>>>>> >>>>>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpetsc.org%2Frelease%2Ffaq%2F%23what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup&data=05%7C01%7CDonald.Planalp%40colorado.edu%7C6f2603753f31407f550e08dbe31a8e12%7C3ded8b1b070d462982e4c0b019f46057%7C1%7C0%7C638353475430454549%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v1NA3bvShSVII8ElXEYMJNdWgGvhlEqWUCZdCCuyNQA%3D&reserved=0>) >>>>>> >>>>>> Can you run with -log_view and send the output so we can see directly >>>>>> the performance? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp >>>>>>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> >>>>>>> wrote: >>>>>>> >>>>>>> Hello, and thank you for the quick reply. >>>>>>> >>>>>>> For some context, all of these matrices are produced from a kronecker >>>>>>> product in parallel between an "angular" and a "radial" matrix. In this >>>>>>> case S_total and H_atom use the same angular matrix which is the >>>>>>> identity, while their >>>>>>> radial matrices obey different sparsity. >>>>>>> >>>>>>> Meanwhile H_mix and H_ang are constructed from two different angular >>>>>>> matrices, however both are only nonzero along the two off diagonals, >>>>>>> and the radial matrices have the same sparse structure. >>>>>>> >>>>>>> So S_total and H_atom have the same block sparse structure, while H_mix >>>>>>> and H_ang have the same block sparse structure. However even just >>>>>>> adding H_mix and H_ang for the simplest case takes around half a second >>>>>>> which is still too slow even when using SAME_NONZERO_PATTERN . >>>>>>> >>>>>>> This confuses me the most because another researcher wrote a similar >>>>>>> code a few years back using finite difference, so his matrices are on >>>>>>> the order 250kx250k and larger, yet the matrix additions are far faster >>>>>>> for him. >>>>>>> >>>>>>> If you would like I can show more code snippets but im not sure what >>>>>>> you would want to see. >>>>>>> >>>>>>> Thank you for your time >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <bsm...@petsc.dev >>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> DIFFERENT_NONZERO_PATTERN will always be slow because it needs to >>>>>>>> determine the nonzero pattern of the result for each computation. >>>>>>>> >>>>>>>> SAME_NONZERO_PATTERN can obviously run fast but may not be practical >>>>>>>> for your problem. >>>>>>>> >>>>>>>> SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO >>>>>>>> nonzero locations that A does not have) does not need to reallocate >>>>>>>> anything in A so it can be reasonably fast (we can do more >>>>>>>> optimizations on the current code to improve it for you). >>>>>>>> >>>>>>>> So, can you construct the S nonzero structure initially to have the >>>>>>>> union of the nonzero structures of all the matrices that accumulate >>>>>>>> into it? This is the same >>>>>>>> nonzero structure that it "ends up with" in the current code. With >>>>>>>> this nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending >>>>>>>> on how sparse all the matrices that get accumulated into S are, you >>>>>>>> could perhaps go further and make sure they all have the same nonzero >>>>>>>> structure (sure, it uses more memory and will do extra operations, but >>>>>>>> the actual computation will be so fast it may be worth the extra >>>>>>>> computations. >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp >>>>>>>>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I am trying to use petsc4py to conduct a quantum mechanics >>>>>>>>> simulation. I've been able to construct all of the relevant matrices, >>>>>>>>> however I am reaching a gigantic bottleneck. >>>>>>>>> >>>>>>>>> For the simplest problem I am running I have a few matrices each >>>>>>>>> about 5000x5000. In order to begin time propagation I need to add >>>>>>>>> these matrices together. However, on 6 cores of my local machine it >>>>>>>>> is taking approximately 1-2 seconds per addition. Since I need to do >>>>>>>>> this for each time step in my simulation it is prohibitively slow >>>>>>>>> since there could be upwards of 10K time steps. >>>>>>>>> >>>>>>>>> Below is the relevant code: >>>>>>>>> >>>>>>>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN >>>>>>>>> if test2: >>>>>>>>> def makeLeft(S,MIX,ANG,ATOM,i): >>>>>>>>> >>>>>>>>> >>>>>>>>> S.axpy(-Field.pulse[i],MIX,structure) >>>>>>>>> S.axpy(-Field.pulse[i],ANG,structure) >>>>>>>>> S.axpy(-1,ATOM,structure) >>>>>>>>> return S >>>>>>>>> def makeRight(S,MIX,ANG,ATOM,i): >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> S.axpy(Field.pulse[i],MIX,structure) >>>>>>>>> S.axpy(Field.pulse[i],ANG,structure) >>>>>>>>> S.axpy(1,ATOM,structure) >>>>>>>>> >>>>>>>>> return S >>>>>>>>> >>>>>>>>> H_mix = Int.H_mix >>>>>>>>> H_mix.scale(1j * dt /2) >>>>>>>>> >>>>>>>>> H_ang = Int.H_ang >>>>>>>>> H_ang.scale(1j * dt /2) >>>>>>>>> >>>>>>>>> H_atom = Int.H_atom >>>>>>>>> H_atom.scale(1j * dt /2) >>>>>>>>> >>>>>>>>> S = Int.S_total >>>>>>>>> >>>>>>>>> psi_initial = psi.psi_initial.copy() >>>>>>>>> ksp = PETSc.KSP().create(PETSc.COMM_WORLD) >>>>>>>>> >>>>>>>>> >>>>>>>>> for i,t in enumerate(box.t): >>>>>>>>> print(i,L) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> O_L = makeLeft(S,H_mix,H_ang,H_atom,i) >>>>>>>>> O_R = makeRight(S,H_mix,H_ang,H_atom,i) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> if i == 0: >>>>>>>>> known = O_R.getVecRight() >>>>>>>>> sol = O_L.getVecRight() >>>>>>>>> >>>>>>>>> O_R.mult(psi_initial,known) >>>>>>>>> >>>>>>>>> ksp.setOperators(O_L) >>>>>>>>> >>>>>>>>> >>>>>>>>> ksp.solve(known,sol) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> psi_initial.copy(sol) >>>>>>>>> >>>>>>>>> >>>>>>>>> I need to clean it up a bit, but the main point is that I need to add >>>>>>>>> the matrices many times for a single time step. I can't preallocate >>>>>>>>> memory very well since some of the matrices aren't the most sparse >>>>>>>>> either. It seems if I cannot speed up the addition it will be >>>>>>>>> difficult to continue so I was wondering if you had any insights. >>>>>>>>> >>>>>>>>> Thank you for your time >>>>>>>> >>>>>> >>>>