------- Comment #39 from burnus at gcc dot gnu dot org  2009-07-27 13:15 -------
(In reply to comment #38)
> However, the loop can be split: [..]
> making the first loop vectorizable (inner-most loop vectorization).

OK. I tried it with a Fortran program:
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc.f90
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.f90
http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc3.f90

maxloc.f90 is the program from comment 34  (maxloc.s = intel assembler)
maxloc2.f90 models what gfortran does for maxloc (maxloc.s = intel assembler)
maxloc3.f90 models what has a split loop

The splitting plus vectorization makes the calculation 5% faster - 0m2.152s
(maxloc3) vs 0m2.260s (maxloc2). Still, that's 35% more than ifort needs.

For some reason, maxloc2 with -fno-tree-vectorize takes only
0m1.840s.(Identical to intel's result for maxloc2/maxloc3. While for the
original maxloc.f90, there is no performance effect, and for maxloc3
vectorization makes it faster.)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067

Reply via email to