Is there an option to turn off MAT_SUBSET_OFF_PROC_ENTRIES for Mohammad to try?
--Junchao Zhang On Sun, Mar 28, 2021 at 3:34 PM Jed Brown <[email protected]> wrote: > I take it this was using MAT_SUBSET_OFF_PROC_ENTRIES. I implemented that > to help performance of PHASTA and other applications that assemble matrices > that are relatively cheap to solve (so assembly cost is significant > compared to preconditioner setup and KSPSolve) and I'm glad it helps so > much here. > > I don't have an explanation for why you're observing local vector > operations like VecScale and VecMAXPY running over twice as fast in the new > code. These consist of simple code that has not changed, and which are > normally memory bandwidth limited (though some of your problem sizes might > fit in cache). > > Mohammad Gohardoust <[email protected]> writes: > > > Here is the plot of run time in old and new petsc using 1,2,4,8, and 16 > > CPUs (in logarithmic scale): > > > > [image: Screenshot from 2021-03-28 10-48-56.png] > > > > > > > > > > On Thu, Mar 25, 2021 at 12:51 PM Mohammad Gohardoust < > [email protected]> > > wrote: > > > >> That's right, these loops also take roughly half time as well. If I am > not > >> mistaken, petsc (MatSetValue) is called after doing some calculations > over > >> each tetrahedral element. > >> Thanks for your suggestion. I will try that and will post the results. > >> > >> Mohammad > >> > >> On Wed, Mar 24, 2021 at 3:23 PM Junchao Zhang <[email protected]> > >> wrote: > >> > >>> > >>> > >>> > >>> On Wed, Mar 24, 2021 at 2:17 AM Mohammad Gohardoust < > [email protected]> > >>> wrote: > >>> > >>>> So the code itself is a finite-element scheme and in stage 1 and 3 > there > >>>> are expensive loops over entire mesh elements which consume a lot of > time. > >>>> > >>> So these expensive loops must also take half time with newer petsc? > And > >>> these loops do not call petsc routines? > >>> I think you can build two PETSc versions with the same configuration > >>> options, then run your code with one MPI rank to see if there is a > >>> difference. > >>> If they give the same performance, then scale to 2, 4, ... ranks and > see > >>> what happens. > >>> > >>> > >>> > >>>> > >>>> Mohammad > >>>> > >>>> On Tue, Mar 23, 2021 at 6:08 PM Junchao Zhang < > [email protected]> > >>>> wrote: > >>>> > >>>>> In the new log, I saw > >>>>> > >>>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- > Messages --- -- Message Lengths -- -- Reductions -- > >>>>> Avg %Total Avg %Total Count > %Total Avg %Total Count %Total > >>>>> 0: Main Stage: 5.4095e+00 2.3% 4.3700e+03 0.0% > 4.764e+05 3.0% 3.135e+02 1.0% 2.244e+04 12.6% 1: > Solute_Assembly: 1.3977e+02 59.4% 7.3353e+09 4.6% 3.263e+06 20.7% > 1.278e+03 26.9% 1.059e+04 6.0% > >>>>> > >>>>> > >>>>> But I didn't see any event in this stage had a cost close to 140s. > What > >>>>> happened? > >>>>> > >>>>> --- Event Stage 1: Solute_Assembly > >>>>> > >>>>> BuildTwoSided 3531 1.0 2.8025e+0026.3 0.00e+00 0.0 3.6e+05 > 4.0e+00 3.5e+03 1 0 2 0 2 1 0 11 0 33 0 > >>>>> BuildTwoSidedF 3531 1.0 2.8678e+0013.2 0.00e+00 0.0 7.1e+05 > 3.6e+03 3.5e+03 1 0 5 17 2 1 0 22 62 33 0 > >>>>> VecScatterBegin 7062 1.0 7.1911e-02 1.9 0.00e+00 0.0 7.1e+05 > 3.5e+02 0.0e+00 0 0 5 2 0 0 0 22 6 0 0 > >>>>> VecScatterEnd 7062 1.0 2.1248e-01 3.0 1.60e+06 2.7 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 73 > >>>>> SFBcastOpBegin 3531 1.0 2.6516e-02 2.4 0.00e+00 0.0 3.6e+05 > 3.5e+02 0.0e+00 0 0 2 1 0 0 0 11 3 0 0 > >>>>> SFBcastOpEnd 3531 1.0 9.5041e-02 4.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>>>> SFReduceBegin 3531 1.0 3.8955e-02 2.1 0.00e+00 0.0 3.6e+05 > 3.5e+02 0.0e+00 0 0 2 1 0 0 0 11 3 0 0 > >>>>> SFReduceEnd 3531 1.0 1.3791e-01 3.9 1.60e+06 2.7 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 112 > >>>>> SFPack 7062 1.0 6.5591e-03 2.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>>>> SFUnpack 7062 1.0 7.4186e-03 2.1 1.60e+06 2.7 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 2080 > >>>>> MatAssemblyBegin 3531 1.0 4.7846e+00 1.1 0.00e+00 0.0 7.1e+05 > 3.6e+03 3.5e+03 2 0 5 17 2 3 0 22 62 33 0 > >>>>> MatAssemblyEnd 3531 1.0 1.5468e+00 2.7 1.68e+07 2.7 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 1 2 0 0 0 104 > >>>>> MatZeroEntries 3531 1.0 3.0998e-02 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > >>>>> > >>>>> > >>>>> --Junchao Zhang > >>>>> > >>>>> > >>>>> > >>>>> On Tue, Mar 23, 2021 at 5:24 PM Mohammad Gohardoust < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> Thanks Dave for your reply. > >>>>>> > >>>>>> For sure PETSc is awesome :D > >>>>>> > >>>>>> Yes, in both cases petsc was configured with --with-debugging=0 and > >>>>>> fortunately I do have the old and new -log-veiw outputs which I > attached. > >>>>>> > >>>>>> Best, > >>>>>> Mohammad > >>>>>> > >>>>>> On Tue, Mar 23, 2021 at 1:37 AM Dave May <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Nice to hear! > >>>>>>> The answer is simple, PETSc is awesome :) > >>>>>>> > >>>>>>> Jokes aside, assuming both petsc builds were configured with > >>>>>>> —with-debugging=0, I don’t think there is a definitive answer to > your > >>>>>>> question with the information you provided. > >>>>>>> > >>>>>>> It could be as simple as one specific implementation you use was > >>>>>>> improved between petsc releases. Not being an Ubuntu expert, the > change > >>>>>>> might be associated with using a different compiler, and or a more > >>>>>>> efficient BLAS implementation (non threaded vs threaded). However > I doubt > >>>>>>> this is the origin of your 2x performance increase. > >>>>>>> > >>>>>>> If you really want to understand where the performance improvement > >>>>>>> originated from, you’d need to send to the email list the result of > >>>>>>> -log_view from both the old and new versions, running the exact > same > >>>>>>> problem. > >>>>>>> > >>>>>>> From that info, we can see what implementations in PETSc are being > >>>>>>> used and where the time reduction is occurring. Knowing that, it > should be > >>>>>>> clearer to provide an explanation for it. > >>>>>>> > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Dave > >>>>>>> > >>>>>>> > >>>>>>> On Tue 23. Mar 2021 at 06:24, Mohammad Gohardoust < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I am using a code which is based on petsc (and also parmetis). > >>>>>>>> Recently I made the following changes and now the code is running > about two > >>>>>>>> times faster than before: > >>>>>>>> > >>>>>>>> - Upgraded Ubuntu 18.04 to 20.04 > >>>>>>>> - Upgraded petsc 3.13.4 to 3.14.5 > >>>>>>>> - This time I installed parmetis and metis directly via petsc > by > >>>>>>>> --download-parmetis --download-metis flags instead of > installing them > >>>>>>>> separately and using --with-parmetis-include=... and > >>>>>>>> --with-parmetis-lib=... (the version of installed parmetis was > 4.0.3 before) > >>>>>>>> > >>>>>>>> I was wondering what can possibly explain this speedup? Does > anyone > >>>>>>>> have any suggestions? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Mohammad > >>>>>>>> > >>>>>>> >
