On Thu, Jun 10, 2021 at 5:55 PM edgar <edgar...@cryptolab.net> wrote:
> On 2021-06-10 19:27, John Peterson wrote: > > I recorded the "Active time" for the "Matrix Assembly Performance" > > PerfLog > > in introduction_ex4 running "./example-opt -d 3 -n 40" for both the > > original codepath and your proposed change, averaging the results over > > 5 > > runs. The results were: > > > > Original code, "./example-opt -d 3 -n 40" > > import numpy as np > > np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93 > > > > Patch, "./example-opt -d 3 -n 40" > > import numpy as np > > np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00 > > > > so I'd say the original code path is marginally (but still > > statistically > > significantly) faster, although keep in mind that matrix assembly is > > only > > about 21% of the total time for this example while the solve is about > > 71%. > > Superinteresting, I am sending you my benchmarks. I must say that I had > initially run only 2 benchmarks, and both came out faster with the > modifications. Now, I found that > - The original code is more efficient with `-n 40' > - The modified code is more efficient with `-n 15' and `mpirun -np 4' > - That I ran the 5-test trial several times and some times, the original > code was more efficient with `-n 15', but the first and second run with > the modified code were always faster (my computer heating up?) > > The gains are really marginal in any case. It would be interesting to > run with -O3... (I just did [1]). > It seems that the differences are now a little bit more substantial, and > that the modified code would be faster. I hope not to have made any > mistakes. > > The code and the benchmarks are in the attached file. > - examples > |- introduction > |- ex4 (original code) > |- output_*_.txt.bz2 (running -n 40 with -O2) > |- output_15_*_.txt.bz2 (running -n 15 with -O2) > |- output_40_O3_*_.txt.bz2 (running -n 40 with -O3) > |- ex4_mod (modified code) > |- output_*_.txt.bz2 (running -n 40 with -O2) > |- output_15_*_.txt.bz2 (running -n 15 with -O2) > |- output_40_O3_*_.txt.bz2 (running -n 40 with -O3) > > > [1] I manually compiled like this (added -O3 instead of -O2; disregard > the CCFLAGS et al): > > $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3 > Your compiler flags are definitely far more advanced/aggressive than mine, which are just on the default of -O2. However, I think what we should conclude from your results is that there is something slower than it needs to be with DenseMatrix::resize(), not that we should move the DenseMatrix creation/destruction inside the loop over elements. What I tried (see attached patch or the "dense_matrix_resize_no_virtual" branch in my fork) is avoiding the virtual function call to DenseMatrix::zero() which is currently made from DenseMatrix::resize(). In my testing, this change did not seem to make much of a difference but I'm curious about what you would get with your compiler args, this patch, and the unpatched ex4. -- John _______________________________________________ Libmesh-users mailing list Libmesh-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libmesh-users