Re: [petsc-users] Petsc4py Simulation: Mat.axpy() slow

Barry Smith Sat, 11 Nov 2023 18:29:12 -0800

   The flop rate of 123 megaflops is absurdly small, something unexected must 
be happening. Any chance you can send me small code that reproduces the below?



> On Nov 11, 2023, at 9:19 PM, Donald Rex Planalp <donald.plan...@colorado.edu> 
> wrote:
> 
> I've isolated the profiling to only run during the addition of the two 
> matrices. The output is 
> 
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> MatAssemblyBegin       1 1.0 4.5000e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 7.4370e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
> MatAXPY                1 1.0 6.6962e-01 1.0 8.24e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1 31  0  0  0 100 100  0  0  0   123
> ------------------------------------------------------------------------------------------------------------------------
> 
> On Sat, Nov 11, 2023 at 6:20 PM Donald Rex Planalp 
> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> wrote:
>> My apologies I missed your other question,
>> 
>> Yes I imagine most of those setValue and getRow are due to the construction 
>> of the matrices in question. I Ended up writing my own "semi-parallel" 
>> kronecker product function which involves setting and getting 
>> a lot of values. I imagine I should optimize that much more, but I consider 
>> it an upfront cost to time propagation, so I haven't touched it since I got 
>> it working. 
>> 
>> Thank you for the quick replies
>> 
>> 
>> 
>> On Sat, Nov 11, 2023 at 6:12 PM Donald Rex Planalp 
>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> wrote:
>>> My apologies, that was run with 7 cores, this is the result for n = 1.
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSided         13 1.0 1.4770e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> BuildTwoSidedF        13 1.0 2.3150e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecView               24 1.0 3.1525e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecCopy               12 1.0 4.1000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyBegin      13 1.0 5.1290e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd        13 1.0 7.0200e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSetRandom           2 1.0 6.2500e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult              112 1.0 1.7071e-03 1.0 1.41e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  5  0  0  0   0  5  0  0  0  8268
>>> MatSolve             112 1.0 8.3160e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  3  0  0  0   0  3  0  0  0  8486
>>> MatLUFactorSym         2 1.0 1.2660e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatLUFactorNum         2 1.0 2.2895e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  4646
>>> MatScale               2 1.0 4.4903e-02 1.0 1.65e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0 61  0  0  0   0 61  0  0  0  3671
>>> MatAssemblyBegin     169 1.0 4.5040e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd       169 1.0 2.2126e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00 40  0  0  0  0  40  0  0  0  0     0
>>> MatSetValues        9078 1.0 6.1440e-01 1.0 8.24e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1 31  0  0  0   1 31  0  0  0   134
>>> MatGetRow           9078 1.0 5.8936e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetRowIJ            2 1.0 9.0900e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatGetOrdering         2 1.0 3.6880e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAXPY                1 1.0 6.4112e-01 1.0 8.24e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1 31  0  0  0   1 31  0  0  0   129
>>> PCSetUp                2 1.0 4.6308e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2297
>>> PCApply              112 1.0 8.4632e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  3  0  0  0   0  3  0  0  0  8339
>>> KSPSetUp               2 1.0 4.5000e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve             112 1.0 8.9241e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  3  0  0  0   0  3  0  0  0  7908
>>> EPSSetUp               2 1.0 6.5904e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1614
>>> EPSSolve               2 1.0 4.4982e-03 1.0 2.11e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  8  0  0  0   0  8  0  0  0  4687
>>> STSetUp                2 1.0 5.0441e-04 1.0 1.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2109
>>> STComputeOperatr       2 1.0 2.5700e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> STApply              112 1.0 1.6852e-03 1.0 1.41e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  5  0  0  0   0  5  0  0  0  8375
>>> STMatSolve           112 1.0 9.1930e-04 1.0 7.06e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  3  0  0  0   0  3  0  0  0  7677
>>> BVCopy                20 1.0 1.7930e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> BVMultVec            219 1.0 2.3467e-04 1.0 2.31e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  9861
>>> BVMultInPlace         12 1.0 1.1088e-04 1.0 1.09e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  9857
>>> BVDotVec             219 1.0 2.9956e-04 1.0 2.47e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  8245
>>> BVOrthogonalizeV     114 1.0 5.8952e-04 1.0 4.82e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  2  0  0  0   0  2  0  0  0  8181
>>> BVScale              114 1.0 2.9150e-05 1.0 4.06e+04 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1392
>>> BVSetRandom            2 1.0 9.6900e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> BVMatMultVec         112 1.0 1.7464e-03 1.0 1.41e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  5  0  0  0   0  5  0  0  0  8082
>>> DSSolve               10 1.0 9.1354e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> DSVectors             24 1.0 1.1908e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> DSOther               30 1.0 1.2798e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> On Sat, Nov 11, 2023 at 5:59 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>>    How many MPI processes did you use?    Please try with just one to get 
>>>> a base line
>>>> 
>>>> MatSetValues        1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1 32  0  0  0   1 32  0  0  0   669
>>>> MatGetRow           1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatAXPY                1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 
>>>> 6.0e+00  1 32  5 14  0   1 32  5 14  0   478
>>>> 
>>>>    This is concerning:  there are 1298 MatSetValues and MatGetRow, are you 
>>>> calling them? If not that means the MatAXPY is calling them (if 
>>>> SAME_NONZERO_PATTERN is not used these are used in the MatAXPY). 
>>>> 
>>>>> I'm still somewhat new to parallel computing, so I'm not sure what you 
>>>>> mean by binding. Does this lock in certain processes to certain cores? 
>>>> 
>>>> 
>>>>    Yes, it is usually the right thing to do for numerical computing. I 
>>>> included a link to some discussion of it on petsc.org <http://petsc.org/>
>>>> 
>>>>> On Nov 11, 2023, at 6:38 PM, Donald Rex Planalp 
>>>>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> wrote:
>>>>> 
>>>>> Hello again,
>>>>> 
>>>>> I've run the simulation with profiling again. In this setup I only ran 
>>>>> the necessary methods to construct the matrices, and then at the end I 
>>>>> added H_mix and H_ang using the Mataxpy method. Below are the results 
>>>>> 
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> Event                Count      Time (sec)     Flop                       
>>>>>        --- Global ---  --- Stage ----  Total
>>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> 
>>>>> --- Event Stage 0: Main Stage
>>>>> 
>>>>> BuildTwoSided        350 1.0 6.9605e-01 1.9 0.00e+00 0.0 3.9e+02 5.6e+00 
>>>>> 3.5e+02  3  0 24  0 22   3  0 24  0 22     0
>>>>> BuildTwoSidedF       236 1.0 6.9569e-01 1.9 0.00e+00 0.0 2.3e+02 1.1e+01 
>>>>> 2.4e+02  3  0 14  0 15   3  0 14  0 15     0
>>>>> SFSetGraph           114 1.0 2.0555e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> SFSetUp              116 1.0 1.2123e-03 1.1 0.00e+00 0.0 6.2e+02 8.9e+02 
>>>>> 1.1e+02  0  0 38 71  7   0  0 38 71  7     0
>>>>> SFPack               312 1.0 6.2350e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> SFUnpack             312 1.0 2.2470e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> VecView               26 1.0 4.0207e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> VecCopy               13 1.0 7.6100e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> VecAssemblyBegin      14 1.0 2.9690e-04 2.1 0.00e+00 0.0 2.3e+02 1.1e+01 
>>>>> 1.4e+01  0  0 14  0  1   0  0 14  0  1     0
>>>>> VecAssemblyEnd        14 1.0 2.9400e-05 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> VecScatterBegin      312 1.0 6.0288e-04 1.6 0.00e+00 0.0 7.3e+02 3.1e+02 
>>>>> 1.0e+02  0  0 45 29  6   0  0 45 29  7     0
>>>>> VecScatterEnd        312 1.0 9.1886e-04 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> VecSetRandom           2 1.0 3.6500e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> MatMult              104 1.0 2.5355e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 
>>>>> 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
>>>>> MatSolve             104 1.0 2.4431e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
>>>>> 1.1e+02  0  0 49 29  7   0  0 49 29  7    36
>>>>> MatLUFactorSym         2 1.0 1.2584e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> MatLUFactorNum         2 1.0 7.7804e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> MatScale               2 1.0 4.5068e-02 1.0 2.36e+07 1.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0 65  0  0  0   0 65  0  0  0  3657
>>>>> MatAssemblyBegin     169 1.0 4.2296e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 1.1e+02  2  0  0  0  7   2  0  0  0  7     0
>>>>> MatAssemblyEnd       169 1.0 3.6979e+00 1.0 0.00e+00 0.0 5.9e+02 9.4e+02 
>>>>> 7.8e+02 22  0 36 71 48  22  0 36 71 49     0
>>>>> MatSetValues        1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  1 32  0  0  0   1 32  0  0  0   669
>>>>> MatGetRow           1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> MatAXPY                1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 
>>>>> 6.0e+00  1 32  5 14  0   1 32  5 14  0   478
>>>>> PCSetUp                2 1.0 2.0848e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
>>>>> PCApply              104 1.0 2.4459e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
>>>>> 1.1e+02  0  0 49 29  7   0  0 49 29  7    36
>>>>> KSPSetUp               2 1.0 1.8400e-06 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> KSPSolve             104 1.0 2.5008e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
>>>>> 2.2e+02  0  0 49 29 14   0  0 49 29 14    35
>>>>> EPSSetUp               2 1.0 2.4027e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 1.8e+01  0  0  0  0  1   0  0  0  0  1     0
>>>>> EPSSolve               2 1.0 3.0789e-02 1.0 1.09e+06 1.1 8.0e+02 2.8e+02 
>>>>> 4.5e+02  0  3 49 29 28   0  3 49 29 28   242
>>>>> STSetUp                2 1.0 2.1294e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
>>>>> STComputeOperatr       2 1.0 2.1100e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> STApply              104 1.0 2.5290e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 
>>>>> 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
>>>>> STMatSolve           104 1.0 2.5081e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
>>>>> 2.2e+02  0  0 49 29 14   0  0 49 29 14    35
>>>>> BVCopy                20 1.0 2.8770e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> BVMultVec            206 1.0 1.8736e-04 1.5 3.13e+05 1.1 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0 11442
>>>>> BVMultInPlace         11 1.0 9.2730e-05 1.2 1.49e+05 1.1 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 11026
>>>>> BVDotVec             206 1.0 1.0389e-03 1.2 3.35e+05 1.1 0.0e+00 0.0e+00 
>>>>> 2.1e+02  0  1  0  0 13   0  1  0  0 13  2205
>>>>> BVOrthogonalizeV     106 1.0 1.2649e-03 1.1 6.84e+05 1.1 0.0e+00 0.0e+00 
>>>>> 2.1e+02  0  2  0  0 13   0  2  0  0 13  3706
>>>>> BVScale              106 1.0 9.0329e-05 1.9 5.51e+03 1.1 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0   418
>>>>> BVSetRandom            2 1.0 1.4080e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> BVMatMultVec         104 1.0 2.5468e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 
>>>>> 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
>>>>> DSSolve                9 1.0 1.2452e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> DSVectors             24 1.0 1.5345e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> DSOther               27 1.0 1.6735e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> 
>>>>> I believe that the only matrix addition I did is shown there to take 
>>>>> about 0.17 seconds. Doing so twice per timestep in addition to scaling, 
>>>>> this ends up taking about 0.4 seconds
>>>>> during the full calculation so the timestep is far too slow.
>>>>> 
>>>>> However it seems that the number of flops is on the order of 10^7. Would 
>>>>> this imply that 
>>>>> I am  perhaps memory bandwidth bound on my home pc?
>>>>> 
>>>>> I'm still somewhat new to parallel computing, so I'm not sure what you 
>>>>> mean by binding. Does this lock in certain processes to certain cores? 
>>>>> 
>>>>> Thank you for your time and assistance
>>>>> 
>>>>> On Sat, Nov 11, 2023 at 2:54 PM Barry Smith <bsm...@petsc.dev 
>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>> 
>>>>>>   Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other 
>>>>>> formats are similar)
>>>>>> 
>>>>>> PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure 
>>>>>> str)
>>>>>> {
>>>>>>   Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data;
>>>>>>   PetscInt     bs = Y->rmap->bs, bs2 = bs * bs;
>>>>>>   PetscBLASInt one = 1;
>>>>>> ....
>>>>>> if (str == SAME_NONZERO_PATTERN) {
>>>>>>     PetscScalar  alpha = a;
>>>>>>     PetscBLASInt bnz;
>>>>>>     PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz));
>>>>>>     PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, 
>>>>>> &one));
>>>>>>     PetscCall(PetscObjectStateIncrease((PetscObject)Y));
>>>>>> 
>>>>>> It directly adds the nonzero values from the two matrices together (the 
>>>>>> nonzero structure plays no role) and it uses the BLAS so it should 
>>>>>> perform as "fast as possible" given the hardware (are you configured 
>>>>>> --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI 
>>>>>> is using the best combination of cores? 
>>>>>> https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup
>>>>>>  
>>>>>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpetsc.org%2Frelease%2Ffaq%2F%23what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup&data=05%7C01%7CDonald.Planalp%40colorado.edu%7C6f2603753f31407f550e08dbe31a8e12%7C3ded8b1b070d462982e4c0b019f46057%7C1%7C0%7C638353475430454549%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=v1NA3bvShSVII8ElXEYMJNdWgGvhlEqWUCZdCCuyNQA%3D&reserved=0>)
>>>>>> 
>>>>>>  Can you run with -log_view and send the output so we can see directly 
>>>>>> the performance?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp 
>>>>>>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Hello, and thank you for the quick reply.
>>>>>>> 
>>>>>>> For some context, all of these matrices are produced from a kronecker 
>>>>>>> product in parallel between an "angular" and a "radial" matrix. In this 
>>>>>>> case S_total and H_atom use the same angular matrix which is the 
>>>>>>> identity, while their 
>>>>>>> radial matrices obey different sparsity.
>>>>>>> 
>>>>>>>  Meanwhile H_mix and H_ang are constructed from two different angular 
>>>>>>> matrices, however both are only nonzero along the two off diagonals, 
>>>>>>> and the radial matrices have the same sparse structure. 
>>>>>>> 
>>>>>>> So S_total and H_atom have the same block sparse structure, while H_mix 
>>>>>>> and H_ang have the same block sparse structure. However even just 
>>>>>>> adding H_mix and H_ang for the simplest case takes around half a second 
>>>>>>> which is still too slow even when using SAME_NONZERO_PATTERN . 
>>>>>>> 
>>>>>>> This confuses me the most because another researcher wrote a similar 
>>>>>>> code a few years back using finite difference, so his matrices are on 
>>>>>>> the order 250kx250k and larger, yet the matrix additions are far faster 
>>>>>>> for him. 
>>>>>>> 
>>>>>>> If you would like I can show more code snippets but im not sure what 
>>>>>>> you would want to see.
>>>>>>> 
>>>>>>> Thank you for your time
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <bsm...@petsc.dev 
>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  DIFFERENT_NONZERO_PATTERN will always be slow because it needs to 
>>>>>>>> determine the nonzero pattern of the result for each computation.
>>>>>>>> 
>>>>>>>>  SAME_NONZERO_PATTERN can obviously run fast but may not be practical 
>>>>>>>> for your problem. 
>>>>>>>> 
>>>>>>>>  SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO 
>>>>>>>> nonzero locations that A does not have) does not need to reallocate 
>>>>>>>> anything in A so it can be reasonably fast (we can do more 
>>>>>>>> optimizations on the current code to improve it for you).
>>>>>>>> 
>>>>>>>>   So, can you construct the S nonzero structure initially to have the 
>>>>>>>> union of the nonzero structures of all the matrices that accumulate 
>>>>>>>> into it? This is the same
>>>>>>>> nonzero structure that it "ends up with" in the current code. With 
>>>>>>>> this nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending 
>>>>>>>> on how sparse all the matrices that get accumulated into S are, you 
>>>>>>>> could perhaps go further and make sure they all have the same nonzero 
>>>>>>>> structure (sure, it uses more memory and will do extra operations, but 
>>>>>>>> the actual computation will be so fast it may be worth the extra 
>>>>>>>> computations.
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp 
>>>>>>>>> <donald.plan...@colorado.edu <mailto:donald.plan...@colorado.edu>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hello,
>>>>>>>>> 
>>>>>>>>> I am trying to use petsc4py to conduct a quantum mechanics 
>>>>>>>>> simulation. I've been able to construct all of the relevant matrices, 
>>>>>>>>> however I am reaching a gigantic bottleneck. 
>>>>>>>>> 
>>>>>>>>> For the simplest problem I am running I have a few matrices each 
>>>>>>>>> about 5000x5000. In order to begin time propagation I need to add 
>>>>>>>>> these matrices together. However, on 6 cores of my local machine it 
>>>>>>>>> is taking approximately 1-2 seconds per addition. Since I need to do 
>>>>>>>>> this for each time step in my simulation it is prohibitively slow 
>>>>>>>>> since there could be upwards of 10K time steps. 
>>>>>>>>> 
>>>>>>>>> Below is the relevant code:
>>>>>>>>> 
>>>>>>>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN
>>>>>>>>>     if test2:
>>>>>>>>>         def makeLeft(S,MIX,ANG,ATOM,i):
>>>>>>>>>             
>>>>>>>>> 
>>>>>>>>>             S.axpy(-Field.pulse[i],MIX,structure)
>>>>>>>>>             S.axpy(-Field.pulse[i],ANG,structure)
>>>>>>>>>             S.axpy(-1,ATOM,structure)
>>>>>>>>>             return S
>>>>>>>>>         def makeRight(S,MIX,ANG,ATOM,i):
>>>>>>>>>             
>>>>>>>>>             
>>>>>>>>>             
>>>>>>>>>             S.axpy(Field.pulse[i],MIX,structure)
>>>>>>>>>             S.axpy(Field.pulse[i],ANG,structure)
>>>>>>>>>             S.axpy(1,ATOM,structure)
>>>>>>>>>             
>>>>>>>>>             return S
>>>>>>>>> 
>>>>>>>>>         H_mix = Int.H_mix
>>>>>>>>>         H_mix.scale(1j * dt /2)
>>>>>>>>>   
>>>>>>>>>         H_ang = Int.H_ang
>>>>>>>>>         H_ang.scale(1j * dt /2)
>>>>>>>>> 
>>>>>>>>>         H_atom = Int.H_atom
>>>>>>>>>         H_atom.scale(1j * dt /2)
>>>>>>>>> 
>>>>>>>>>         S = Int.S_total
>>>>>>>>> 
>>>>>>>>>         psi_initial = psi.psi_initial.copy()
>>>>>>>>>         ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
>>>>>>>>> 
>>>>>>>>>    
>>>>>>>>>         for i,t in enumerate(box.t):
>>>>>>>>>             print(i,L)
>>>>>>>>> 
>>>>>>>>>             
>>>>>>>>> 
>>>>>>>>>             O_L = makeLeft(S,H_mix,H_ang,H_atom,i)
>>>>>>>>>             O_R = makeRight(S,H_mix,H_ang,H_atom,i)
>>>>>>>>> 
>>>>>>>>>             
>>>>>>>>> 
>>>>>>>>>             if i == 0:
>>>>>>>>>                 known = O_R.getVecRight()
>>>>>>>>>                 sol = O_L.getVecRight()
>>>>>>>>>         
>>>>>>>>>             O_R.mult(psi_initial,known)
>>>>>>>>> 
>>>>>>>>>             ksp.setOperators(O_L)
>>>>>>>>>         
>>>>>>>>>             
>>>>>>>>>             ksp.solve(known,sol)
>>>>>>>>> 
>>>>>>>>>             
>>>>>>>>>             
>>>>>>>>>     
>>>>>>>>>             psi_initial.copy(sol)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I need to clean it up a bit, but the main point is that I need to add 
>>>>>>>>> the matrices many times for a single time step. I can't preallocate 
>>>>>>>>> memory very well since some of the matrices aren't the most sparse 
>>>>>>>>> either. It seems if I cannot speed up the addition it will be 
>>>>>>>>> difficult to continue so I was wondering if you had any insights.
>>>>>>>>> 
>>>>>>>>> Thank you for your time
>>>>>>>> 
>>>>>> 
>>>>

Re: [petsc-users] Petsc4py Simulation: Mat.axpy() slow

Reply via email to