michaelselehov wrote:
Reverts #185083.
This change causes a numerical-correctness regression on AMDGPU: HIP FP16
wavefront reductions (`__reduce_{min,max}_sync<__half>`) return wrong values on
gfx90a (verified on MI210). The PR author has explicitly given the green light
to revert while they investigate further.
### Root cause
HIP declares `__half` as a doubly-nested aggregate
(`hip/amd_detail/amd_hip_fp16.h`):
```
%struct.__half = type { %union.anon }
%union.anon = type { half }
```
With this PR in place, clang's AMDGPU ABI lowering passes the HIP `__half` via
`ABIArgInfo::getDirect()` as an aggregate. The kernel's call into the
device-libs wave-reduction helper comes out as:
```llvm
%fca = insertvalue %union.anon poison, half %x, 0
%c = tail call %struct.__half @__ockl_wfred_min_f16(%union.anon %fca) #9
```
The body of `__ockl_wfred_min_f16` comes from pre-built `rocm-device-libs`
bitcode, whose source uses a plain `half` parameter:
```llvm
define internal half @__ockl_wfred_min_f16(half noundef %0)
```
Call-site argument type (`%union.anon`) does not match the callee formal type
(`half`), and likewise for the return type. LLVM's `InlineFunction()`
conservatively refuses to inline calls with mismatched argument SSA types, so
this `convergent` wave-reduction helper stays out-of-line.
This matters because the kernel calls the helper from inside a divergent region
(the `__reduce_*_sync` lane-mask `if`). AMDGPU lowers a call inside divergent
control by saving EXEC and entering the callee with EXEC = (lanes that took the
branch). The `update.dpp` / `ds_bpermute` intrinsics in the helper are
`convergent` and physically read from neighbouring lanes' VGPRs -- but those
neighbours never entered the callee, so the DPP picks up stale data left in
`v0` from before the call (an unrelated input, a previous result, whatever
happened to be there). When the helper *is* inlined, the same intrinsics run
under the kernel's full-wave EXEC and the inactive lanes' inputs are explicitly
replaced with identity values (`0x7C00` / `0xFC00`) before the DPP, so the
reduction is correct. The numerical failure pattern matches exactly: `min`
collapses toward 0 (stale `0x0000` wins), `max` picks up an unrelated neighbour.
Coordinated rebuild of `rocm-device-libs` does not fix this. The device-libs
source uses plain `half`, not HIP's `__half`, so its IR signature stays
`half(half)` regardless of which clang builds it. The mismatch is structural --
between the HIP header type and the device-libs source type -- and was
previously masked because pre-#185083 the aggregate got coerced to a
scalar-shaped value that the inliner could trivially bitcast through.
### Verification
Reproduced with a single-translation-unit HIP program at `-O2`
(`__reduce_{min,max}_sync<__half>` over a small random buffer, then a host-side
bitwise compare): ~48k FP16 mismatches with original PR in place, 0 mismatches
with this revert applied.
https://github.com/llvm/llvm-project/pull/199981
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits