--- Comment #2 from dorit at gcc dot gnu dot org 2007-08-30 10:12 ---
There are two time consuming routines in air.f90 of the Polyhedron
benchmark that are not vectorized: lines 1328 and 1354. These appear
in the top counting of execution time with oprofile:
SUBROUTINE DERIVY(D,U,Uy,Al,Np,Nd,M)
IMPLICIT REAL*8(A-H,O-Z)
PARAMETER (NX=150,NY=150)
DIMENSION D(NY,33) , U(NX,NY) , Uy(NX,NY) , Al(30) , Np(30)
DO jm = 1 , M
jmax = 0
jmin = 1
DO i = 1 , Nd
jmax = jmax + Np(i) + 1
DO j = jmin , jmax
uyt = 0.
DO k = 0 , Np(i)
uyt = uyt + D(j,k+1)*U(jm,jmin+k)
ENDDO
Uy(jm,j) = uyt*Al(i)
ENDDO
jmin = jmin + Np(i) + 1
ENDDO
ENDDO
CONTINUE
END
./poly_air_1354.f90:12: note: def_stmt: uyt_1 = PHI 0.0(9), uyt_42(11)
./poly_air_1354.f90:12: note: Unsupported pattern.
./poly_air_1354.f90:12: note: not vectorized: unsupported use in stmt.
./poly_air_1354.f90:12: note: unexpected pattern.
./poly_air_1354.f90:1: note: vectorized 0 loops in function.
This is due to an unsupported type, real_type, for the reduction variable uyt:
(this is on an i686-linux machine)
There is no unhandled real_type problem, you just need to use -ffast-math to
allow vectorization of summation of fp types (or the new reassociation flag):
pr33243b.f90:12: note: Analyze phi: uyt_1 = PHI 0.0(9), uyt_42(11)
pr33243b.f90:12: note: reduction: unsafe fp math optimization: D.1386_41 +
uyt_1
pr33243b.f90:12: note: Unknown def-use cycle pattern.
If you use -ffast-math the reduction is detected:
pr33243b.f90:12: note: Analyze phi: uyt_1 = PHI 0.0(9), uyt_42(11)
pr33243b.f90:12: note: detected reduction:D.1386_41 + uyt_1
pr33243b.f90:12: note: Detected reduction.
However, the loop will still not get vectorized because there is a
non-consecutive access in the loop:
pr33243b.f90:12: note: === vect_analyze_data_ref_accesses ===
pr33243b.f90:12: note: not consecutive access
pr33243b.f90:12: note: not vectorized: complicated access pattern.
This is because the stride of the accesses to D(j,k+1) and U(jm,jmin+k) in the
inner-loop (k-loop) between inner-loop iterations is 1200B:
DO j = jmin , jmax
uyt = 0.
DO k = 0 , NP(i)
uyt = uyt + D(j,k+1)*U(jm,jmin+k)
ENDDO
Uy(jm,j) = uyt*Al(i)
ENDDO
In the outer-loop (j-loop) these accesses are consecutive, and also you don't
need to use the -ffast-math flag. However there are other problems:
1) the compiler creates a guard to control whether to enter the inner-loop or
not (cause it may execute 0 times). This creates a more involved control-flow
than the outer-loop vectorizer is willing to work with. A solution would be to
create this guard outside the outer-loop (in case it is invariant, as is the
case here), which is like versioning the loop (or unswichting the loop).
2) if you change the loop count to something constant (just to bypass the above
problem), then indeed no guard code is generated, but there is a computation
(advancing an iv) in the latch block of the outer-loop (so it is not empty, and
we are not willing to work with such loops). We need to clean that away.
3) After these problems are solved, we still need to deal with a
non-consecutive access in the outer-loop - the store to Uy(jm,j). AFAICS, this
requires either transposing the Uy array in advance, or teaching the vectorizer
to scatter the results to the non-adjacent locations (which would be quite
expensive, but we could give it a try).
Alternatively, vectorizing the inner-loop would require transposing the D and U
matrices.
Another option is to interchange the jm loop with the j loop - I think this way
all accesses would be consecutive, and we could vectorize the jm loop (which
would now be a doubly-nested loop that the outer-loop vectorizer could handle).
So, the PR for this testcase would be better classified under one of the above
problems/missed-optimizations rather than unhandled real_type.
Another similar routine that also appears in the top ranked and not
vectorized due to the same unsupported real_type reasons is in air.f90:1181
SUBROUTINE FVSPLTX2
IMPLICIT REAL*8(A-H,O-Z)
PARAMETER (NX=150,NY=150)
DIMENSION DX(NX,33) , ALX(30) , NPX(30)
DIMENSION FP1(NX,NY) , FM1(NX,NY) , FP1x(30,NX) , FM1x(30,NX)
DIMENSION FP2(NX,NY) , FM2(NX,NY) , FP2x(30,NX) , FM2x(30,NX)
DIMENSION FP3(NX,NY) , FM3(NX,NY) , FP3x(30,NX) , FM3x(30,NX)
DIMENSION FP4(NX,NY) , FM4(NX,NY) , FP4x(30,NX) , FM4x(30,NX)
DIMENSION FV2(NX,NY) , DXP2(30,NX) , DXM2(30,NX)
DIMENSION FV3(NX,NY) , DXP3(30,NX) , DXM3(30,NX)
DIMENSION FV4(NX,NY) , DXP4(30,NX) , DXM4(30,NX)
COMMON /XD1 / FP1 , FM1 , FP2 , FM2 , FP3 , FM3 , FP4 , FM4 ,