Issue 106011
Summary Lack of or poor autovectorization of loops over references into contiguous memory
Labels new issue
Assignees
Reporter the8472
    Rust iterators over slices yield references to each item. Even when iterating over references to primitives and applying simple, pure functions LLVM only manages to unroll the loops but fails to vectorize them.


## Case 1 - slice.iter().fold()

https://rust.godbolt.org/z/fn3W8PGE3

The second function from https://github.com/rust-lang/rust/issues/113789

```rust
pub fn check_range_fold(keys: &[u32], max: u32) -> bool {
 keys.iter().fold(true, |a, x| a && *x < max)
}
```

results in this loop body

```assembly
.LBB0_5:
        cmp     dword ptr [rdi + 4*r8], edx
        setb    r9b
        and     r9b, al
 cmp     dword ptr [rdi + 4*r8 + 4], edx
        setb    al
        cmp dword ptr [rdi + 4*r8 + 8], edx
        setb    r10b
        and r10b, al
        and     r10b, r9b
        cmp     dword ptr [rdi + 4*r8 + 12], edx
        setb    al
        and     al, r10b
 add     r8, 4
        cmp     rsi, r8
        jne .LBB0_5
```

This is unrolled and the values are loaded unconditionally, so it's not an issue of dereferencability or profitability of loading the data.

Expected behavior: Vectorizes with `vpcmpud`

## Case 2 - manually unrolled slice.iter().all()

https://rust.godbolt.org/z/dxEsqj3fx

This is derived from an attempt to implement manual unrolling for `iterator.all()` - which is short-circuiting - to convince LLVM that vectorization would be profitable. Again the unrolling happens and the values are loaded unconditionally but they're not vectorized.

```rust
#![feature(slice_as_chunks)]

pub fn all_auto(a: &[u32], b: u32) -> bool {
    all_generic_simple(a, |&x| x == b)
}


fn all_generic_simple<'a>(a: &'a [u32], mut b: impl FnMut(&'a u32) -> bool) -> bool {
    const UNROLL: usize = 4;

 let (chunks, remainder) = a.as_chunks::<UNROLL>();

    for chunk in chunks {
        let chunk: &[u32; 4] = chunk;

        if !(b(&chunk[0]) && b(&chunk[1]) && b(&chunk[2]) && b(&chunk[3])) {
 return false;
        }
    }

 remainder.iter().all(b)
}


```assembly
.LBB0_1:
 test    rsi, rsi
        je      .LBB0_2
        cmp     dword ptr [rax], edx
        jne     .LBB0_12
        cmp     dword ptr [rax + 4], edx
        jne     .LBB0_12
        cmp     dword ptr [rax + 8], edx
        jne     .LBB0_12
        add     rsi, -16
        lea r9, [rax + 16]
        cmp     dword ptr [rax + 12], edx
        mov rax, r9
        je      .LBB0_1
```

Expected behavior: Vectorizes with `vpcmpeqd`


Curiously putting the unrolled short-circuiting comparison in a function - which still gets inlined - *and* bloading its return type with an additional dummy value causes it to vectorize.

```assembly
.LBB0_2:
        vpcmpeqd        k0, xmm0, xmmword ptr [rdi + rcx]
        kmovd   r8d, k0
        cmp r8b, 15
        jne     .LBB0_10
        add     rcx, 16
        cmp rax, rcx
        jne     .LBB0_2
```

Rust issue similar to this case: https://github.com/rust-lang/rust/issues/105259

## Case 3 - manually unrolled slice.iter().max()

https://github.com/rust-lang/rust/issues/106539

https://rust.godbolt.org/z/3W5ne4hsK

This does attempt to vectorize, but the code looks pretty bad. The inner loop still seems to contain a bunch of individual loads. I was expecting some use of `vpcmpud`  in there.
Though this case is less important I realize AVX/AVX2 don't have unsigned comparisons and I'd actually have to write this differently to get something based on `vpmaxud` instead.

```rust
#![feature(slice_as_chunks)]

pub fn iter_max<'a>(a: &'a [u32]) -> Option<&u32> {
    const UNROLL: usize = 8;

    let (chunks, remainder) = a.as_chunks::<UNROLL>();

 let mut cur_max = unsafe { chunks.get(0)?.iter().max().unwrap_unchecked() };
    let mut cur = *cur_max;

    for chunk in chunks {
 // this should be the fast path that vectorizes
        if chunk.map(|e| e >= cur) == [false ; UNROLL] {
            continue;
        }

 for i in 0..UNROLL {
            if chunk[i] >= cur {
 cur_max = &chunk[i];
            }
        }

        cur = *cur_max;
    }

    for r in remainder {
        cur_max = if r >= cur_max { r } else { cur_max };
    }

 Some(cur_max)
}
```

```assembly
.LBB0_3:
        lea r9, [rdi + 32]
        cmp     r9, r8
        je      .LBB0_4
 mov     ebp, dword ptr [rdi + 32]
        mov     r10d, dword ptr [rdi + 52]
        xor     r14d, r14d
        cmp     r10d, esi
 setae   r14b
        mov     r11d, dword ptr [rdi + 56]
        xor r12d, r12d
        cmp     r11d, esi
        setae   r12b
 mov     ebx, dword ptr [rdi + 60]
        xor     r13d, r13d
 cmp     ebx, esi
        setae   r13b
        xor     r15d, r15d
 cmp     ebp, esi
        setae   r15b
```
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to