[Bug middle-end/107718] New: clang optimizes TSVC s317 a lot better

hubicka at gcc dot gnu.org via Gcc-bugs Wed, 16 Nov 2022 09:11:43 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107718


            Bug ID: 107718
           Summary: clang optimizes TSVC s317 a lot better
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

This is a stupid benchmark but still...

jh@alberti:~/tsvc/bin> more tt2.c

typedef double real_t;
#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256
real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D];
real_t qq;
int
main(void)
{

    real_t q;
    for (int nl = 0; nl < 5*iterations; nl++) {
        q = (real_t)1.;
        for (int i = 0; i < LEN_1D/2; i++) {
            q *= (real_t).99;
        }
        qq+=q;
    }

    return q;
}
jh@alberti:~/tsvc/bin> time ./a.out

real    0m0.805s
user    0m0.805s
sys     0m0.000s
jh@alberti:~/tsvc/bin> clang -Ofast -march=native tt2.c  
jh@alberti:~/tsvc/bin> time ./a.out

real    0m0.010s
user    0m0.007s
sys     0m0.003s

Clang does:
.LBB0_2:                                #   Parent Loop BB0_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmulpd  %zmm2, %zmm3, %zmm3
        vmulpd  %zmm2, %zmm4, %zmm4
        vmulpd  %zmm2, %zmm5, %zmm5
        vmulpd  %zmm2, %zmm6, %zmm6
        addl    $-3200, %ecx                    # imm = 0xF380
        jne     .LBB0_2
# %bb.3:                                #   in Loop: Header=BB0_1 Depth=1
        vmulpd  %zmm3, %zmm4, %zmm3


So it runs multiplications and because of unrolling combines the exponent?

[Bug middle-end/107718] New: clang optimizes TSVC s317 a lot better

Reply via email to