[libclc] [libclc] Use unchecked division for f64 math (PR #203809)

Joseph Huber via cfe-commits Sun, 14 Jun 2026 18:59:55 -0700

https://github.com/jhuber6 created 
https://github.com/llvm/llvm-project/pull/203809


Summary:
This matches what the AMD device libraries does. We can avoid extra
steps by only performing the two steps of the Newton-Raphson
approximation of 1 / x. The exceptional cases should not appear in these
math functions, this is local to AMDGPU, and I verified they are bitwise
identical to the AMD math functions with parity in performance now.


>From e08e7ee7b9705b1a4c3435099054154310a0ceb8 Mon Sep 17 00:00:00 2001
From: Joseph Huber <[email protected]>
Date: Sun, 14 Jun 2026 20:56:00 -0500
Subject: [PATCH] [libclc] Use unchecked division for f64 math

Summary:
This matches what the AMD device libraries does. We can avoid extra
steps by only performing the two steps of the Newton-Raphson
approximation of 1 / x. The exceptional cases should not appear in these
math functions, this is local to AMDGPU, and I verified they are bitwise
identical to the AMD math functions with parity in performance now.
---
 libclc/clc/lib/amdgpu/math/clc_recip_fast.inc | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc 
b/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc
index 9d635cc700442..e19ec82d7566d 100644
--- a/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc
+++ b/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc
@@ -6,10 +6,17 @@
 //
 
//===----------------------------------------------------------------------===//
 
-// On AMDGPU the "fast" reciprocal is the hardware v_rcp_f32 approximation,
+// On AMDGPU the "fast" reciprocal is the hardware v_rcp approximation.
 _CLC_OVERLOAD _CLC_DEF __CLC_GENTYPE __clc_recip_fast(__CLC_GENTYPE x) {
 #if defined(__CLC_SCALAR) && __CLC_FPSIZE == 32
   return __builtin_amdgcn_rcpf(x);
+#elif defined(__CLC_SCALAR) && __CLC_FPSIZE == 64
+  // Hardware v_rcp_f64 seed refined with two Newton-Raphson iterations. This
+  // computes 1.0 / x without the full IEEE scaling and subnormal fixups.
+  __CLC_GENTYPE r = __builtin_amdgcn_rcp(x);
+  r = __builtin_fma(__builtin_fma(-x, r, 1.0), r, r);
+  r = __builtin_fma(__builtin_fma(-x, r, 1.0), r, r);
+  return r;
 #else
   return ((__CLC_GENTYPE)1.0) / x;
 #endif

_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[libclc] [libclc] Use unchecked division for f64 math (PR #203809)

Reply via email to