https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109985

            Bug ID: 109985
           Summary: __builtin_prefetch ignored by GCC 12/13
           Product: gcc
           Version: 13.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pdimov at gmail dot com
  Target Milestone: ---

We are investigating a Boost.Unordered performance regression with GCC 12,
on the following benchmark:

https://github.com/boostorg/boost_unordered_benchmarks/blob/4c717baac1bff8d3e51cb8485b72bbb63d533265/scattered_lookup.cpp

and it looks like the reason is that GCC 12 (and 13) ignore a call to
`__builtin_prefetch`.

While GCC 11 generates this:

```
.L108:
        mov     r8, r12
        movdqa  xmm0, xmm1
        sal     r8, 4
        lea     r14, [r10+r8]
        pcmpeqb xmm0, XMMWORD PTR [r14]
        pmovmskb        edx, xmm0
        and     edx, 32767
        je      .L104
        sub     r8, r12
        sal     r8, 4
        add     r8, QWORD PTR [rbx+32]
        prefetcht0      [r8]
.L106:
        xor     r15d, r15d
        rep bsf r15d, edx
        movsx   r15, r15d
        sal     r15, 4
        add     r15, r8
        cmp     rsi, QWORD PTR [r15]
        jne     .L144
        add     r9, QWORD PTR [r15+8]
        mov     rax, rdi
        cmp     r11, rdi
        jne     .L145
```
(https://godbolt.org/z/d663fdM16 - prefetcht0 [r8] right before L106)

GCC 12 generates this in the same function:
```
.L108:
        mov     r8, r10
        movdqa  xmm0, xmm1
        sal     r8, 4
        lea     r9, [rbp+0+r8]
        pcmpeqb xmm0, XMMWORD PTR [r9]
        pmovmskb        edx, xmm0
        and     edx, 32767
        je      .L104
        mov     rdi, QWORD PTR [rsp+16]
        sub     r8, r10
        mov     QWORD PTR [rsp+24], rax
        sal     r8, 4
        mov     rdi, QWORD PTR [rdi+32]
        mov     QWORD PTR [rsp+8], rdi
        mov     rax, rdi
.L106:
        xor     edi, edi
        rep bsf edi, edx
        movsx   rdi, edi
        sal     rdi, 4
        add     rdi, r8
        add     rdi, rax
        cmp     r11, QWORD PTR [rdi]
        jne     .L143
        add     rsi, 8
        add     rbx, QWORD PTR [rdi+8]
        cmp     r12, rsi
        jne     .L109
```
(https://godbolt.org/z/T7csq7TPz - no prefetcht0 instruction before L106)

Simplifying this code unfortunately leads to the prefetcht0 being generated.

Reply via email to