https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107718
Bug ID: 107718 Summary: clang optimizes TSVC s317 a lot better Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- This is a stupid benchmark but still... jh@alberti:~/tsvc/bin> more tt2.c typedef double real_t; #define iterations 100000 #define LEN_1D 32000 #define LEN_2D 256 real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D]; real_t qq; int main(void) { real_t q; for (int nl = 0; nl < 5*iterations; nl++) { q = (real_t)1.; for (int i = 0; i < LEN_1D/2; i++) { q *= (real_t).99; } qq+=q; } return q; } jh@alberti:~/tsvc/bin> time ./a.out real 0m0.805s user 0m0.805s sys 0m0.000s jh@alberti:~/tsvc/bin> clang -Ofast -march=native tt2.c jh@alberti:~/tsvc/bin> time ./a.out real 0m0.010s user 0m0.007s sys 0m0.003s Clang does: .LBB0_2: # Parent Loop BB0_1 Depth=1 # => This Inner Loop Header: Depth=2 vmulpd %zmm2, %zmm3, %zmm3 vmulpd %zmm2, %zmm4, %zmm4 vmulpd %zmm2, %zmm5, %zmm5 vmulpd %zmm2, %zmm6, %zmm6 addl $-3200, %ecx # imm = 0xF380 jne .LBB0_2 # %bb.3: # in Loop: Header=BB0_1 Depth=1 vmulpd %zmm3, %zmm4, %zmm3 So it runs multiplications and because of unrolling combines the exponent?