I take it this was using MAT_SUBSET_OFF_PROC_ENTRIES. I implemented that to 
help performance of PHASTA and other applications that assemble matrices that 
are relatively cheap to solve (so assembly cost is significant compared to 
preconditioner setup and KSPSolve) and I'm glad it helps so much here.

I don't have an explanation for why you're observing local vector operations 
like VecScale and VecMAXPY running over twice as fast in the new code. These 
consist of simple code that has not changed, and which are normally memory 
bandwidth limited (though some of your problem sizes might fit in cache).  

Mohammad Gohardoust <[email protected]> writes:

> Here is the plot of run time in old and new petsc using 1,2,4,8, and 16
> CPUs (in logarithmic scale):
>
> [image: Screenshot from 2021-03-28 10-48-56.png]
>
>
>
>
> On Thu, Mar 25, 2021 at 12:51 PM Mohammad Gohardoust <[email protected]>
> wrote:
>
>> That's right, these loops also take roughly half time as well. If I am not
>> mistaken, petsc (MatSetValue) is called after doing some calculations over
>> each tetrahedral element.
>> Thanks for your suggestion. I will try that and will post the results.
>>
>> Mohammad
>>
>> On Wed, Mar 24, 2021 at 3:23 PM Junchao Zhang <[email protected]>
>> wrote:
>>
>>>
>>>
>>>
>>> On Wed, Mar 24, 2021 at 2:17 AM Mohammad Gohardoust <[email protected]>
>>> wrote:
>>>
>>>> So the code itself is a finite-element scheme and in stage 1 and 3 there
>>>> are expensive loops over entire mesh elements which consume a lot of time.
>>>>
>>> So these expensive loops must also take half time with newer petsc?  And
>>> these loops do not call petsc routines?
>>> I think you can build two PETSc versions with the same configuration
>>> options, then run your code with one MPI rank to see if there is a
>>> difference.
>>> If they give the same performance, then scale to 2, 4, ... ranks and see
>>> what happens.
>>>
>>>
>>>
>>>>
>>>> Mohammad
>>>>
>>>> On Tue, Mar 23, 2021 at 6:08 PM Junchao Zhang <[email protected]>
>>>> wrote:
>>>>
>>>>> In the new log, I saw
>>>>>
>>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages 
>>>>> ---  -- Message Lengths --  -- Reductions --
>>>>>                         Avg     %Total     Avg     %Total    Count   
>>>>> %Total     Avg         %Total    Count   %Total
>>>>>  0:      Main Stage: 5.4095e+00   2.3%  4.3700e+03   0.0%  4.764e+05   
>>>>> 3.0%  3.135e+02        1.0%  2.244e+04  12.6% 1: Solute_Assembly: 
>>>>> 1.3977e+02  59.4%  7.3353e+09   4.6%  3.263e+06  20.7%  1.278e+03       
>>>>> 26.9%  1.059e+04   6.0%
>>>>>
>>>>>
>>>>> But I didn't see any event in this stage had a cost close to 140s. What
>>>>> happened?
>>>>>
>>>>>  --- Event Stage 1: Solute_Assembly
>>>>>
>>>>> BuildTwoSided       3531 1.0 2.8025e+0026.3 0.00e+00 0.0 3.6e+05 4.0e+00 
>>>>> 3.5e+03  1  0  2  0  2   1  0 11  0 33     0
>>>>> BuildTwoSidedF      3531 1.0 2.8678e+0013.2 0.00e+00 0.0 7.1e+05 3.6e+03 
>>>>> 3.5e+03  1  0  5 17  2   1  0 22 62 33     0
>>>>> VecScatterBegin     7062 1.0 7.1911e-02 1.9 0.00e+00 0.0 7.1e+05 3.5e+02 
>>>>> 0.0e+00  0  0  5  2  0   0  0 22  6  0     0
>>>>> VecScatterEnd       7062 1.0 2.1248e-01 3.0 1.60e+06 2.7 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0    73
>>>>> SFBcastOpBegin      3531 1.0 2.6516e-02 2.4 0.00e+00 0.0 3.6e+05 3.5e+02 
>>>>> 0.0e+00  0  0  2  1  0   0  0 11  3  0     0
>>>>> SFBcastOpEnd        3531 1.0 9.5041e-02 4.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> SFReduceBegin       3531 1.0 3.8955e-02 2.1 0.00e+00 0.0 3.6e+05 3.5e+02 
>>>>> 0.0e+00  0  0  2  1  0   0  0 11  3  0     0
>>>>> SFReduceEnd         3531 1.0 1.3791e-01 3.9 1.60e+06 2.7 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   112
>>>>> SFPack              7062 1.0 6.5591e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> SFUnpack            7062 1.0 7.4186e-03 2.1 1.60e+06 2.7 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2080
>>>>> MatAssemblyBegin    3531 1.0 4.7846e+00 1.1 0.00e+00 0.0 7.1e+05 3.6e+03 
>>>>> 3.5e+03  2  0  5 17  2   3  0 22 62 33     0
>>>>> MatAssemblyEnd      3531 1.0 1.5468e+00 2.7 1.68e+07 2.7 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   1  2  0  0  0   104
>>>>> MatZeroEntries      3531 1.0 3.0998e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>>
>>>>>
>>>>> --Junchao Zhang
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 23, 2021 at 5:24 PM Mohammad Gohardoust <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Dave for your reply.
>>>>>>
>>>>>> For sure PETSc is awesome :D
>>>>>>
>>>>>> Yes, in both cases petsc was configured with --with-debugging=0 and
>>>>>> fortunately I do have the old and new -log-veiw outputs which I attached.
>>>>>>
>>>>>> Best,
>>>>>> Mohammad
>>>>>>
>>>>>> On Tue, Mar 23, 2021 at 1:37 AM Dave May <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Nice to hear!
>>>>>>> The answer is simple, PETSc is awesome :)
>>>>>>>
>>>>>>> Jokes aside, assuming both petsc builds were configured with
>>>>>>> —with-debugging=0, I don’t think there is a definitive answer to your
>>>>>>> question with the information you provided.
>>>>>>>
>>>>>>> It could be as simple as one specific implementation you use was
>>>>>>> improved between petsc releases. Not being an Ubuntu expert, the change
>>>>>>> might be associated with using a different compiler, and or a more
>>>>>>> efficient BLAS implementation (non threaded vs threaded). However I 
>>>>>>> doubt
>>>>>>> this is the origin of your 2x performance increase.
>>>>>>>
>>>>>>> If you really want to understand where the performance improvement
>>>>>>> originated from, you’d need to send to the email list the result of
>>>>>>> -log_view from both the old and new versions, running the exact same
>>>>>>> problem.
>>>>>>>
>>>>>>> From that info, we can see what implementations in PETSc are being
>>>>>>> used and where the time reduction is occurring. Knowing that, it should 
>>>>>>> be
>>>>>>> clearer to provide an explanation for it.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>> On Tue 23. Mar 2021 at 06:24, Mohammad Gohardoust <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am using a code which is based on petsc (and also parmetis).
>>>>>>>> Recently I made the following changes and now the code is running 
>>>>>>>> about two
>>>>>>>> times faster than before:
>>>>>>>>
>>>>>>>>    - Upgraded Ubuntu 18.04 to 20.04
>>>>>>>>    - Upgraded petsc 3.13.4 to 3.14.5
>>>>>>>>    - This time I installed parmetis and metis directly via petsc by
>>>>>>>>    --download-parmetis --download-metis flags instead of installing 
>>>>>>>> them
>>>>>>>>    separately and using --with-parmetis-include=... and
>>>>>>>>    --with-parmetis-lib=... (the version of installed parmetis was 
>>>>>>>> 4.0.3 before)
>>>>>>>>
>>>>>>>> I was wondering what can possibly explain this speedup? Does anyone
>>>>>>>> have any suggestions?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Mohammad
>>>>>>>>
>>>>>>>

Reply via email to