| Issue |
175900
|
| Summary |
First-fault/fault-only-first load vectorization
|
| Labels |
backend:RISC-V,
vectorizers
|
| Assignees |
lukel97,
arcbbb
|
| Reporter |
lukel97
|
We can vectorize the following uncountable early-exit loop today because we know all of `x` is dereferenceable:
```c
extern void f(char *);
int find(char y) {
char x[1024];
f(x);
for (int i = 0; i < 1024; i++)
if (x[i] == y)
return i;
return -1;
}
```
https://godbolt.org/z/5Ea7a8TbW
However we cannot vectorize uncountable early-exit loops when we can't guarantee that `x` won't cross a page boundary and trap[^gcc]:
```c
int find(char *x, char y, int n) {
for (int i = 0; i < n; i++)
if (x[i] == y)
return i;
return -1;
}
```
https://godbolt.org/z/ncjaje9a3
[^gcc]: GCC can vectorize this loop because it seems to version the loop on the precondition that x doesn't cross a page boundary.
To handle these cases SVE and RVV provide "first-faulting" or "fault-only-first" instructions, `ldff1*` and `vle*ff.v` respectively.
@arcbbb is adding RVV support for this in #151300 and already landed much of the prerequsite legality and costing work in #152422, #160470, #168029.
This issue aims to track the design and implementation around supporting these loops in the loop vectorizer.
Below I've raised a few points that I would like to discuss around the long-term design:
## SVE support
We have a [`vp.load.ff` intrinsic](https://llvm.org/docs/LangRef.html#llvm-vp-load-ff-intrinsic) that only RISC-V is able to lower.
AArch64 can't use this intrinsic because it doesn't have the notion of an explicit vector length, but instead works with predicate/mask registers.
I think we need a new `llvm.masked.load.ff` intrinsic that mirrors the vp intrinsic, minus the EVL.
```llvm
declare {<vscale x 4 x i32>, i32} @llvm.vp.load.ff(ptr %ptr, <vscale x 4 x i1> %mask, i32 %evl)
declare {<vscale x 4 x i32>, <vscale x 4 x i1>} @llvm.masked.load.ff(ptr %ptr, <vscale x 4 x i1> %mask)
```
I don't think we necessarily need to support a `llvm.masked.load.ff` intrinsic before #151300, and I'm not sure if the AArch64 folks already have something in mind.
However if we do end up with a masked version of the intrinsic, I think we should first generate a `llvm.masked.load.ff` and then progressively lower it to `llvm.vp.load.ff` on RISC-V, similarly to how we progressively lower masked tail folding to EVL tail folding.
## Variable stepping
One of the main considerations with EVL tail folding is that it requires changing the loop region so that it no longer steps by a constant amount, but by an amount determined by `@llvm.get.active.vector.length`.
Fault-only-first vectorization has the same constraint on both SVE and RVV, regardless of whether or not its tail folded, because both `ldff1*` and `vle*ff.v` may reduce the number of lanes read for any reason[^1].
[^1]: In the RISC-V ISA: "Even when an exception is not raised, implementations are permitted to process fewer than vl elements
and reduce vl accordingly, but if vstart=0 and vl>0, then at least one element must be processed.". For SVE, the operation pseudocode allows for a fault to be raised by the implementation for any element: https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/LDFF1B--scalar-plus-scalar---Contiguous-load-first-fault-unsigned-bytes-to-vector--scalar-index--?lang=en
In #151300 we have to do part of the conversion to variable stepping that `VPlanTransforms::addExplicitVectorLength` also does.
Can we somehow share this part of this transform? Moreover, because the step is no longer guaranteed to be a multiple of VF do we also need to fixup induction variable recipes and other users of VF, like in `fixupVFUsersForEVL`?
Also related may be the idea that we should go through most VPlan transforms and distinguish users of the canonical IV between those that need know when to exit and those that need the "cumulative elements processed count", see https://github.com/llvm/llvm-project/pull/166164/changes#r2584271079
## Tail folding
Currently tail folding is not supported for early exit loops. Because of this, the code generated on RISC-V [without tail-folding needs to perform csr reads of `vl` and compute masks, which is expensive.](https://github.com/llvm/llvm-project/pull/151300#pullrequestreview-3571036875). These masks aren't needed with EVL tail folding, so eventually we should prefer tail folding where possible. #172454 and #174974 aim to enable tail folding with early exit loops.
I'm not yet certain if we should only allow first-only-fault loads with tail folding, or if we can handle fault-only-first loads independently provided we untangle variable stepping as mentioned above.
## Intrinsics required for `brkb`, `brka`/ `vmsbf.m`, `vmsif.m`
Both SVE and RVV have instructions that compute a prefix mask based on the first active element of another mask. These are useful for early-exit loops as they allow "exited" lanes past the early-exit to be masked off. For example for a vectorized strcpy, SVE would use `brka` and RVV would use `vmsif.m` to only copy the bytes up to and including the null terminator:
```asm
# rvv
# char* strcpy(char *dst, const char* src)
strcpy:
mv a2, a0
li t0, -1
loop:
vsetvli x0, t0, e8, m8, ta, ma
vle8ff.v v8, (a1)
csrr t1, vl
vmseq.vi v1, v8, 0
vfirst.m a3, v1
add a1, a1, t1
vmsif.m v0, v1
vse8.v v8, (a2), v0.t
add a2, a2, t1
bltz a3, loop
ret
# sve
strcpy:
mov x2, 0
ptrue p2.b
loop:
setffr
ldff1b z0.b, p2/z, [x1, x2]
rdffr p0.b, p2/z
cmpeq p1.b, p0/z, z0.b, 0
brka p0.b, p0/z, p1.b
st1b z0.b, p0, [x0, x2]
incp x2, p0.b
b.none loop
ret
```
We could represent these instructions as `icmp ult/ule step-vector, (cttz-elts x)`, but given that these are single instructions for SVE and RVV we should probably introduce dedicated target agnostic intrinsics so we can cost these accurately in VPlan.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs