Toon Moene wrote: > Paolo Bonzini wrote: > >>> Attached you'll find the (preprocessed) source of the routine that >>> printed the Infinity's (of course, I cannot be completely certain that >>> it actually resulted in the wrong code, but at least it might be studied >>> to see if it helps to find the culprit). >> >> No, this function is sane (the peephole *is* called a lot by this >> function, but all is in due order). I looked at the dumps and assembly >> for -O2, -O3 and -O3 -fno-schedule-insns (*), and all is as expected. > > Yeah, it was probably too much to hope for.
No, you were right, and that's great. -ffast-math makes a difference, because it enables more vectorization. It goes as this: (insn 494 493 495 44 statin.f:703 (set (reg:SF 371) (vec_select:SF (reg:V4SF 367) (parallel [ (const_int 0 [0x0]) ]))) 1408 {*vec_extractv4sf_0} (expr_list:REG_DEAD (reg:V4SF 367) (nil))) registers 371 and 367 are coalesced into xmm0. Then the vec_select is split to just (set (reg:SF 21 [orig: 371]) (reg:SF 21 [orig: 367])) and these are indeed !=, but they have the same hard register number so the peephole should not apply in this case. Here is a minimized testcase: subroutine statin(x,y,pstratr,pconvecr,zhxy,zhxhy,ztmp) integer :: x,y real pstratr(x,y),pconvecr(x,y),zhxy(x,y) real ztmp(4) do j = 1,y do i = 1,x-2 zttotrainr = zttotrainr + (pstratr(i,j) + pconvecr(i,j))*zhxy(i,j) ztstratr = ztstratr + pstratr(i,j) ztconvecr = ztconvecr + pconvecr(i,j) ztsenf = ztsenf + zhxy(i,j) ztlatf = ztlatf + zhxy(i,j) ztcldtop = ztcldtop + zhxy(i,j) enddo enddo ztmp(1)=zttotrainr ztmp(2)=ztstratr ztmp(3)=ztconvecr ztmp(4)=ztsenf*ztlatf*ztcldtop end The following patch should fix it, you're welcome to run it through HIRLAM. I'm bootstrapping it in the meanwhile. Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 144464) +++ gcc/config/i386/i386.md (working copy) @@ -20795,7 +20795,7 @@ [(match_dup 0) (match_operand:SI 2 "memory_operand" "")])) (clobber (reg:CC FLAGS_REG))])] - "operands[0] != operands[1] + "!rtx_equal_p (operands[0], operands[1]) && GENERAL_REGNO_P (REGNO (operands[0])) && GENERAL_REGNO_P (REGNO (operands[1]))" [(set (match_dup 0) (match_dup 4)) @@ -20811,7 +20811,7 @@ (match_operator 3 "commutative_operator" [(match_dup 0) (match_operand 2 "memory_operand" "")]))] - "operands[0] != operands[1] + "!rtx_equal_p (operands[0], operands[1]) && ((MMX_REG_P (operands[0]) && MMX_REG_P (operands[1])) || (SSE_REG_P (operands[0]) && SSE_REG_P (operands[1])))" [(set (match_dup 0) (match_dup 2)) Paolo