https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600

--- Comment #11 from Jerry DeLisle <jvdelisle at gcc dot gnu.org> ---
(In reply to Jerry DeLisle from comment #8)
> Created attachment 36887 [details]
> A faster version
> 
> I took the example code found in
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based
> vector computations are explicitly called via the SSE registers and
> converted it to use the builtin gcc vector extensions.  I had to experiment
> a little to get some of the equivalent operations of the original code.
> 
> With only -O2 and march=native I am getting good results. I need to roll
> this into the other test program yet to confirm the gflops are being
> computed correctly.  The diff value is comparing to the reference naive
> results to check the computation is correct.
> 
> MY_MMult = [
> Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 
> Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 
> Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 
> Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 
> Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 
> Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 
> Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 
> Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 
> Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 
> Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 
> Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 
> Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 
> Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 
> Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 
> Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 
> Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 
> Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 
> Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 
> Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 
> Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 
> Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 
> Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 
> Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 
> Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 
> Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13

(In reply to Dominique d'Humieres from comment #10)
> > I think you are seeing the effects of inefficiencies of assumed-shape 
> > arrays.
> >
> > If you want to use matmul on very small matrix sizes, it is best to
> > use fixed-size explicit arrays.
> 
> Then IMO the matmul inlining should be restricted to fixed-size explicit
> arrays. Could this be done before the release of gcc-6?

I was experimenting some more here a few days ago.  I really think that
inlineing shold be disabled above some threshold.  On larger arrays, the
runtime library outperforms inline and right now by default the runtime
routines are never used unless you provide -fno-frontend-optimize which is
counter intuitive for the larger arrays.

If one compiles with -march=native -mavx -Ofast etc etc, the inline can do
fairly well on the larger, however when we update the runtime routines to
something like shown in comment #8 it will make even more sense to not do
inline all the time. (Unless of course we further optimize the
frontend-optimize to do better.)

Reply via email to