Issue 91370
Summary [X86] Worse runtime performance on Zen 4 CPU when optimizing for `znver4` or `skylake`
Labels new issue
Assignees
Reporter Systemcluster
    The following code runs around 300% slower on Zen 4 when optimized for `znver4` or `skylake` than when optimized for `znver3` or other targets.

```rust
pub fn sum(a: &[i64]) -> i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
 for i in x {
            sum += i;
        }
    });
 sum
}
```

<details>
<summary>Full code</summary>

```rust
pub fn sum(a: &[i64]) -> i64 {
    let mut sum = 0;
    a.chunks_exact(8).for_each(|x| {
        for i in x {
            sum += i;
        }
    });
    sum
}

fn main() {
    let nums = std::hint::black_box(generate());
    let now = std::time::Instant::now();
    let sum = sum(&nums);
 println!("{:?} / {}", now.elapsed(), sum);
}

fn generate() -> Vec<i64> {
    let mut v = Vec::new();
    for i in 0..1000000000 {
 v.push(i);
    }
 v
}
```

</details>

Running on a Ryzen 7950X:

```cmd
> rustc.exe -Ctarget-cpu=x86-64-v4 -Copt-level=3 .\src\main.rs && ./main.exe
138.7342ms / 499999999500000000

> rustc.exe -Ctarget-cpu=x86-64-v3 -Copt-level=3 .\src\main.rs && ./main.exe
136.2689ms / 499999999500000000

> rustc.exe -Ctarget-cpu=x86-64 -Copt-level=3 .\src\main.rs && ./main.exe 
136.0648ms / 499999999500000000

> rustc.exe -Ctarget-cpu=znver4 -Copt-level=3 .\src\main.rs && ./main.exe   
543.1562ms / 499999999500000000

> rustc.exe -Ctarget-cpu=znver3 -Copt-level=3 .\src\main.rs && ./main.exe   
137.4426ms / 499999999500000000

> rustc.exe -Ctarget-cpu=skylake -Copt-level=3 .\src\main.rs && ./main.exe
588.4743ms / 499999999500000000

> rustc.exe -Ctarget-cpu=haswell -Copt-level=3 .\src\main.rs && ./main.exe
138.5313ms / 499999999500000000
```

Disassembly here: https://godbolt.org/z/fzaGhGdWW

The tested optimization targets all generate different assembly with different levels of unrolling, but the `znver4` and `skylake` targets seem to be outliers.

I don't know whether the `skylake` target has the same issue or whether it's just caused by optimization target / CPU mismatch, but both result in the long list of constant values and show similar runtime performance. I also didn't test other targets than the above listed.

Split from https://github.com/llvm/llvm-project/issues/90985#issuecomment-2096057259
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to