michaelselehov wrote:

@addmisol -- friendly heads-up: we're tracking a regression that bisects 
cleanly to this PR.

In HIP code on `amdgcn` targets, FP16 wavefront reductions 
(`__reduce_{min,max}_sync<__half>`) return wrong values after this change. The 
root cause appears to be: HIP's `__half` (`struct { union { _Float16 } }`) is 
now classified by `ABIArgInfo::getDirect()` as an aggregate, while pre-built 
`rocm-device-libs` still exposes `__ockl_wfred_*_f16` with a flat `half` 
parameter. The resulting call-site / callee signature mismatch blocks inlining 
of a `convergent` wave-reduction helper, which then runs under a divergent-call 
EXEC mask and reads stale data from neighbour lanes via DPP.

Reproducer is a single-file HIP program at `-O2`, ~2 s end-to-end; reverting 
this PR on top of the otherwise-identical tree makes it pass. Confirmed on 
MI200/MI210 (gfx90a).

Are you aware of this, and do you have a preference on how to address it? A 
couple of options we've considered:

- narrow the new classification in `clang/lib/CodeGen/Targets/AMDGPU.cpp` to 
keep the legacy `[N x i32]` coercion for small aggregates whose only data is 
`_Float16` (and `__bf16`), so the calling convention stays compatible with 
already-shipped device-libs;
- or accept that this PR requires a coordinated `rocm-device-libs` rebuild and 
document that expectation.

Happy to share more (full IR/asm diff, bisect trail, etc.) or to test a 
candidate fix. Thanks!

https://github.com/llvm/llvm-project/pull/185083
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to