https://github.com/jhuber6 created 
https://github.com/llvm/llvm-project/pull/176202

Summary:
The NVIDIA handling of ballot and similar is complicated following
Volta. These instructions basically return the list of converged waves
*at the moment of calling*. It's entirely possible for an active mask to
increase during execution. This assumption that other ballots return the
submask is wrong, the one returned from CUDA can increase beyond the
local mask.

This leaves two options, remove the special handling to make it
consistent or add it to NVPTX.

This PR does the first just to give a smaller interface to the builtin.
But I think the second is 'more correct' if we use the ballot directly
as it could result in electing an unexpected leader.


>From 64888070d98f39c18deddd3ab7ad47423c02f152 Mon Sep 17 00:00:00 2001
From: Joseph Huber <[email protected]>
Date: Thu, 15 Jan 2026 11:10:45 -0600
Subject: [PATCH] [Clang] Change ballot mask handling for GPU intrinsics

Summary:
The NVIDIA handling of ballot and similar is complicated following
Volta. These instructions basically return the list of converged waves
*at the moment of calling*. It's entirely possible for an active mask to
increase during execution. This assumption that other ballots return the
submask is wrong, the one returned from CUDA can increase beyond the
local mask.

This leaves two options, remove the special handling to make it
consistent or add it to NVPTX.

This PR does the first just to give a smaller interface to the builtin.
But I think the second is 'more correct' if we use the ballot directly
as it could result in electing an unexpected leader.
---
 clang/lib/Headers/amdgpuintrin.h | 4 +---
 clang/lib/Headers/spirvintrin.h  | 6 ++----
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/clang/lib/Headers/amdgpuintrin.h b/clang/lib/Headers/amdgpuintrin.h
index f7fb8e2814180..fac6dd995eaca 100644
--- a/clang/lib/Headers/amdgpuintrin.h
+++ b/clang/lib/Headers/amdgpuintrin.h
@@ -115,9 +115,7 @@ __gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t 
__x) {
 // Returns a bitmask of threads in the current lane for which \p x is true.
 _DEFAULT_FN_ATTRS static __inline__ uint64_t __gpu_ballot(uint64_t __lane_mask,
                                                           bool __x) {
-  // The lane_mask & gives the nvptx semantics when lane_mask is a subset of
-  // the active threads
-  return __lane_mask & __builtin_amdgcn_ballot_w64(__x);
+  return __builtin_amdgcn_ballot_w64(__x);
 }
 
 // Waits for all the threads in the block to converge and issues a fence.
diff --git a/clang/lib/Headers/spirvintrin.h b/clang/lib/Headers/spirvintrin.h
index 2a10a47adedde..10fb40db02daa 100644
--- a/clang/lib/Headers/spirvintrin.h
+++ b/clang/lib/Headers/spirvintrin.h
@@ -121,12 +121,10 @@ __gpu_read_first_lane_u32(uint64_t __lane_mask, uint32_t 
__x) {
 // implementation is incorrect if the target uses more than 64 lanes.
 _DEFAULT_FN_ATTRS static __inline__ uint64_t __gpu_ballot(uint64_t __lane_mask,
                                                           bool __x) {
-  // The lane_mask & gives the nvptx semantics when lane_mask is a subset of
-  // the active threads.
   uint32_t [[clang::ext_vector_type(4)]] __mask =
       __builtin_spirv_subgroup_ballot(__x);
-  return __lane_mask & __builtin_bit_cast(uint64_t, __builtin_shufflevector(
-                                                        __mask, __mask, 0, 1));
+  return __builtin_bit_cast(uint64_t,
+                            __builtin_shufflevector(__mask, __mask, 0, 1));
 }
 
 // Waits for all the threads in the block to converge and issues a fence.

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to