On Thu, Jun 10, 2021 at 5:55 PM edgar <edgar...@cryptolab.net> wrote:

> On 2021-06-10 19:27, John Peterson wrote:
> > I recorded the "Active time" for the "Matrix Assembly Performance"
> > PerfLog
> > in introduction_ex4 running "./example-opt -d 3 -n 40" for both the
> > original codepath and your proposed change, averaging the results over
> > 5
> > runs. The results were:
> >
> > Original code, "./example-opt -d 3 -n 40"
> > import numpy as np
> > np.mean([3.91801, 3.93206, 3.94358, 3.97729, 3.90512]) = 3.93
> >
> > Patch, "./example-opt -d 3 -n 40"
> > import numpy as np
> > np.mean([4.10462, 4.06232, 3.95176, 3.92786, 3.97992]) = 4.00
> >
> > so I'd say the original code path is marginally (but still
> > statistically
> > significantly) faster, although keep in mind that matrix assembly is
> > only
> > about 21% of the total time for this example while the solve is about
> > 71%.
>
> Superinteresting, I am sending you my benchmarks. I must say that I had
> initially run only 2 benchmarks, and both came out faster with the
> modifications. Now, I found that
> - The original code is more efficient with `-n 40'
> - The modified code is more efficient with `-n 15' and `mpirun -np 4'
> - That I ran the 5-test trial several times and some times, the original
> code was more efficient with `-n 15', but the first and second run with
> the modified code were always faster (my computer heating up?)
>
> The gains are really marginal in any case. It would be interesting to
> run with -O3... (I just did [1]).
> It seems that the differences are now a little bit more substantial, and
> that the modified code would be faster. I hope not to have made any
> mistakes.
>
> The code and the benchmarks are in the attached file.
> - examples
> |- introduction
>   |- ex4                    (original code)
>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>   |- ex4_mod                (modified code)
>    |- output_*_.txt.bz2     (running -n 40 with -O2)
>    |- output_15_*_.txt.bz2     (running -n 15 with -O2)
>    |- output_40_O3_*_.txt.bz2     (running -n 40 with -O3)
>
>
> [1] I manually compiled like this (added -O3 instead of -O2; disregard
> the CCFLAGS et al):
>
>      $ mpicxx -std=gnu++17 -DNDEBUG -march=amdfam10 -O3
>


Your compiler flags are definitely far more advanced/aggressive than mine,
which are just on the default of -O2. However, I think what we should
conclude from your results is that there is something slower than it needs
to be with DenseMatrix::resize(), not that we should move the DenseMatrix
creation/destruction inside the loop over elements. What I tried (see
attached patch or the "dense_matrix_resize_no_virtual" branch in my fork)
is avoiding the virtual function call to DenseMatrix::zero() which is
currently made from DenseMatrix::resize(). In my testing, this change did
not seem to make much of a difference but I'm curious about what you would
get with your compiler args, this patch, and the unpatched ex4.

-- 
John

_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to