michaelselehov wrote:
> Device libs should _only_ be used with a build from the freshly built
> compiler. This is user error
@arsenm I wish it were that easy. Unfortunately, even fresh AOMP and TheRock
builds (where device-libs and clang come out of the same CI invocation) keep
the old `half(half)` signature in device-libs, and the test still fails. We hit
this in the nightly TheRock build originally; that's how we got here.
The reason a rebuild doesn't change anything: this PR touches the
classification of *aggregates* on the caller side, but `__ockl_wfred_*_f16` in
`rocm-device-libs` is sourced as a flat `half` parameter -- not an aggregate.
Rebuilding `ockl.bc` with a clang that has this PR applied still emits `define
half @__ockl_wfred_min_f16(half ...)`. I diff-compared the function signature
in `ockl.bc` coming from /opt/rocm, /COD/2026-04-19, and TheRock #1390 --
byte-identical for this function.
The mismatch is structural, between the HIP header type and the device-libs
source type:
* HIP declares `__half = struct { union { _Float16 } }` (aggregate).
* device-libs' `__ockl_wfred_*_f16` takes plain `half` (scalar).
Pre-#185083 the aggregate was coerced to a small integer / scalar shape that
the inliner could happily bitcast through, so the wrapper was inlined into the
kernel and its DPP / ds_bpermute ran under the caller's full-wave EXEC.
Post-#185083 the aggregate is kept as `{half}` all the way to the call site;
the inliner won't rewrite `{half}` -> `half` SSA, the wrapper stays
out-of-line, and the now-divergent call serializes EXEC. The DPP intrinsics
then read from neighbour lanes that never entered the callee and return
whatever stale data sits in those VGPRs -- which is exactly the failure pattern
we see (`min` collapses to 0, `max` picks up garbage neighbours).
https://github.com/llvm/llvm-project/pull/185083
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits