------- Comment #39 from burnus at gcc dot gnu dot org 2009-07-27 13:15 ------- (In reply to comment #38) > However, the loop can be split: [..] > making the first loop vectorizable (inner-most loop vectorization).
OK. I tried it with a Fortran program: http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc.f90 http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc2.f90 http://users.physik.fu-berlin.de/~tburnus/tmp/vect-PR31067/maxloc3.f90 maxloc.f90 is the program from comment 34 (maxloc.s = intel assembler) maxloc2.f90 models what gfortran does for maxloc (maxloc.s = intel assembler) maxloc3.f90 models what has a split loop The splitting plus vectorization makes the calculation 5% faster - 0m2.152s (maxloc3) vs 0m2.260s (maxloc2). Still, that's 35% more than ifort needs. For some reason, maxloc2 with -fno-tree-vectorize takes only 0m1.840s.(Identical to intel's result for maxloc2/maxloc3. While for the original maxloc.f90, there is no performance effect, and for maxloc3 vectorization makes it faster.) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31067