[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan created this revision.
Herald added subscribers: mattd, gchakrabarti, asavonic, hiraditya.
Herald added a project: All.
hdelan requested review of this revision.
Herald added subscribers: llvm-commits, cfe-commits, jholewinski.
Herald added projects: clang, LLVM.

This patch adds __nvvm_reflect as a clang builtin for NVPTX backend. This means
that __nvvm_reflect can be used in source code in order to enable conditional 
compilation
based on compute capability and FTZ properties.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D137154

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/test/CodeGenCUDA/nvvm-reflect.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/lib/Target/NVPTX/NVVMReflect.cpp
  llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
  llvm/test/CodeGen/NVPTX/nvvm-reflect.ll

Index: llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(i8*)
+declare i32 @llvm.nvvm.reflect(i8*)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)* getelementptr inbounds ([11 x i8], [11 x i8] addrspace(4)* @str, i32 0, i32 0))
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(i8* %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(i8* %ptr)
   ret i32 %reflect
 }
 
Index: llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(ptr)
+declare i32 @llvm.nvvm.reflect(ptr)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call ptr @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(ptr addrspace(4) @str)
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(ptr %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(ptr %ptr)
   ret i32 %reflect
 }
 
Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp
===
--- llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -40,6 +40,7 @@
 #include 
 #include 
 #define NVVM_REFLECT_FUNCTION "__nvvm_reflect"
+#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect"
 
 using namespace llvm;
 
@@ -119,6 +120,7 @@
   continue;
 Function *Callee = Call->getCalledFunction();
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
+Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))
   continue;
 
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1578,7 +1578,8 @@
 Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">;
 
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;
 
 // isspacep.{const, global, local, shared}
 def int_nvvm_isspacep_const
Index: clang/test/CodeGenCUDA/nvvm-reflect.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/nvvm-reflect.cu
@@ -0,0 +1,81 @@
+// REQUIRES: nvptx-registered-target
+
+// Checking to see that __nvvm_reflect resolves to the correct llvm.nvvm.reflect
+// intrinsic
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -disable-llvm-passes -S -emit-llvm -x c++ %s -o - | FileCheck %s --check-prefix=NO_NVVM_REFLECT_PASS
+
+// Prepare bitcode file to link with
+// RUN: %clang_cc1 -triple nvptx-unknown-cuda -emit-llvm-bc \
+// RUN:-disable-llvm-passes -o %t.bc %s
+
+// Checking to see if the correct values are substituted for the nvvm_reflect
+// call when llvm passes are enabled.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_52 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_2
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_53 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_3
+// RUN: %clang_cc1 -fcuda-is-device -

[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Andrew Savonichev via Phabricator via cfe-commits
asavonic added inline comments.



Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:1581
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;

What is there a reason to change the intrinsic?



Comment at: llvm/lib/Target/NVPTX/NVVMReflect.cpp:123
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
+Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))

This seems to be equivalent to `Callee->getIntrinsicID()` call below. Is there 
a case where the intrinsic is not recognized by `getIntrinsicID`?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added inline comments.



Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:1581
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;

asavonic wrote:
> What is there a reason to change the intrinsic?
The clang builtin doesn't link up properly with the llvm intrinsic if 
`llvm_anyptr_ty` is used. The `llvm_anyptr_ty` suggests to clang to find a 
general `arg` parameter instead of an `Pointer` which results in a segfault. 
Changing to `llvm_ptr_ty` fixes the issue


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Andrew Savonichev via Phabricator via cfe-commits
asavonic added inline comments.



Comment at: llvm/include/llvm/IR/IntrinsicsNVVM.td:1581
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;

hdelan wrote:
> asavonic wrote:
> > What is there a reason to change the intrinsic?
> The clang builtin doesn't link up properly with the llvm intrinsic if 
> `llvm_anyptr_ty` is used. The `llvm_anyptr_ty` suggests to clang to find a 
> general `arg` parameter instead of an `Pointer` which results in a segfault. 
> Changing to `llvm_ptr_ty` fixes the issue
This sounds like a bug in clang. If clang's default lowering for builtins 
doesn't play well with `llvm_anyptr_ty`, it should be possible to implement it 
with custom lowering code ( in `CodeGenFunction::EmitNVPTXBuiltinExpr`) without 
changing the intrinsic. Some intrinsics with anyptr arguments are already 
implemented with custom lowering (LDG, some atomics).

Changing the intrinsic is problematic for backward compatibility. Overloaded 
intrinsic from old IR will not be recognized, pointers in non-zero address 
space cannot be used as arguments directly (must be addrspacecast to generic 
AS).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan updated this revision to Diff 472310.
hdelan added a comment.

Removing redundant check


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

Files:
  llvm/lib/Target/NVPTX/NVVMReflect.cpp


Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp
===
--- llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -40,7 +40,6 @@
 #include 
 #include 
 #define NVVM_REFLECT_FUNCTION "__nvvm_reflect"
-#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect"
 
 using namespace llvm;
 
@@ -120,7 +119,6 @@
   continue;
 Function *Callee = Call->getCalledFunction();
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
-Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))
   continue;
 


Index: llvm/lib/Target/NVPTX/NVVMReflect.cpp
===
--- llvm/lib/Target/NVPTX/NVVMReflect.cpp
+++ llvm/lib/Target/NVPTX/NVVMReflect.cpp
@@ -40,7 +40,6 @@
 #include 
 #include 
 #define NVVM_REFLECT_FUNCTION "__nvvm_reflect"
-#define NVVM_REFLECT_LLVM_INTRINSIC_NAME "llvm.nvvm.reflect"
 
 using namespace llvm;
 
@@ -120,7 +119,6 @@
   continue;
 Function *Callee = Call->getCalledFunction();
 if (!Callee || (Callee->getName() != NVVM_REFLECT_FUNCTION &&
-Callee->getName() != NVVM_REFLECT_LLVM_INTRINSIC_NAME &&
 Callee->getIntrinsicID() != Intrinsic::nvvm_reflect))
   continue;
 
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan updated this revision to Diff 472311.
hdelan added a comment.

Removing redundant check


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

Files:
  clang/include/clang/Basic/BuiltinsNVPTX.def
  clang/test/CodeGenCUDA/nvvm-reflect.cu
  llvm/include/llvm/IR/IntrinsicsNVVM.td
  llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
  llvm/test/CodeGen/NVPTX/nvvm-reflect.ll

Index: llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(i8*)
+declare i32 @llvm.nvvm.reflect(i8*)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)* getelementptr inbounds ([11 x i8], [11 x i8] addrspace(4)* @str, i32 0, i32 0))
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(i8* %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(i8* %ptr)
   ret i32 %reflect
 }
 
Index: llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
===
--- llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
+++ llvm/test/CodeGen/NVPTX/nvvm-reflect-opaque.ll
@@ -41,7 +41,7 @@
   ret float %ret
 }
 
-declare i32 @llvm.nvvm.reflect.p0i8(ptr)
+declare i32 @llvm.nvvm.reflect(ptr)
 
 ; CHECK-LABEL: define i32 @intrinsic
 define i32 @intrinsic() {
@@ -49,7 +49,7 @@
 ; USE_FTZ_0: ret i32 0
 ; USE_FTZ_1: ret i32 1
   %ptr = tail call ptr @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(ptr addrspace(4) @str)
-  %reflect = tail call i32 @llvm.nvvm.reflect.p0i8(ptr %ptr)
+  %reflect = tail call i32 @llvm.nvvm.reflect(ptr %ptr)
   ret i32 %reflect
 }
 
Index: llvm/include/llvm/IR/IntrinsicsNVVM.td
===
--- llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1578,7 +1578,8 @@
 Intrinsic<[], [llvm_anyptr_ty], [], "llvm.nvvm.compiler.warn">;
 
 def int_nvvm_reflect :
-  Intrinsic<[llvm_i32_ty], [llvm_anyptr_ty], [IntrNoMem], "llvm.nvvm.reflect">;
+  Intrinsic<[llvm_i32_ty], [llvm_ptr_ty], [IntrNoMem], "llvm.nvvm.reflect">,
+  ClangBuiltin<"__nvvm_reflect">;
 
 // isspacep.{const, global, local, shared}
 def int_nvvm_isspacep_const
Index: clang/test/CodeGenCUDA/nvvm-reflect.cu
===
--- /dev/null
+++ clang/test/CodeGenCUDA/nvvm-reflect.cu
@@ -0,0 +1,81 @@
+// REQUIRES: nvptx-registered-target
+
+// Checking to see that __nvvm_reflect resolves to the correct llvm.nvvm.reflect
+// intrinsic
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -disable-llvm-passes -S -emit-llvm -x c++ %s -o - | FileCheck %s --check-prefix=NO_NVVM_REFLECT_PASS
+
+// Prepare bitcode file to link with
+// RUN: %clang_cc1 -triple nvptx-unknown-cuda -emit-llvm-bc \
+// RUN:-disable-llvm-passes -o %t.bc %s
+
+// Checking to see if the correct values are substituted for the nvvm_reflect
+// call when llvm passes are enabled.
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_52 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_2
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_53 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_3
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_60 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_4
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_61 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_5
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_62 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_6
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_70 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_7
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN:sm_72 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_8
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \
+// RUN: 

[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-01 Thread Artem Belevich via Phabricator via cfe-commits
tra added a subscriber: yaxunl.
tra added a comment.

I don't think it's a good idea. `__nvvm_reflect` is a hack that we should not 
propagate. The only code that currently relies on it is NVIDIA's libdevice 
bitcode and I'd prefer to keep it that way.

Can you elaborate on what motivates this change?

We already have a way to do conditional compilation based on the GPU 
architecture.
If you need the code to know whether FTZ mode is enabled or not, that should be 
exposed on its own. I'm not convinced that it would be a great idea, either. 
LLVM has a lot of ways to control FP code generation (that I'm mostly 
unqualified to comment on) but those options are fine-grained. `__nvvm_reflect` 
would only affect things module-wise and would not always tell you whether FTZ 
instruction variants would be used. E.g. 
llvm/test/CodeGen/NVPTX/math-intrins.ll shows that `ftz` can be controlled per 
function via `"denormal-fp-math-f32" = "preserve-sign"` attribute.

Another aspect of this is that the concept of `ftz` is not unique to NVPTX. I 
believe AMDGPU has `ftz` instruction variants, too. @yaxunl - FYI.

If you need to control what FP instruction variant we generate, you should 
probably use `#pragma clang fp ...` for that and *tell* compiler what you need. 
https://clang.llvm.org/docs/LanguageExtensions.html#extensions-to-specify-floating-point-flags

I do not think clang currently exposes fine-grained details about selected FP 
modes. We do have `__FAST_MATH__`, `__FINITE_MATH_ONLY__`, and 
`__FAST_RELAXED_MATH__`, but that seems to be about it.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-03 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan marked an inline comment as not done.
hdelan added a comment.

In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. Like 
libdevice we want to precompile libclc to bc for the CUDA backend without 
specializing for a particular arch, so that we can call different __nv funcs 
based on the arch. For this reason we use the `__nvvm_reflect` llvm intrinsic.

Without a clang builtin a number of workarounds are necessary to get the 
desired behaviour from `__nvvm_reflect`.

1. We must declare `__nvvm_reflect` in source code everywhere where it is used.
2. Some addrspace casting is necessary in openCL, since strings are in 
the__constant address space. We must use an IR wrapper function to ensure the 
correct behaviour. See 
https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/reflect.ll

We agree that `__nvvm_reflect` is not a perfect solution, but it is something 
we must use because of libdevice. Adding the clang builtin would enable us to 
write libdevice-like libraries without having to resort to the methods 
mentioned above.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-03 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D137154#3904810 , @hdelan wrote:

> In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. 
> Like libdevice we want to precompile libclc to bc for the CUDA backend 
> without specializing for a particular arch, so that we can call different 
> __nv funcs based on the arch. For this reason we use the `__nvvm_reflect` 
> llvm intrinsic.
>
> Without a clang builtin a number of workarounds are necessary to get the 
> desired behaviour from `__nvvm_reflect`.
>
> 1. We must declare `__nvvm_reflect` in source code everywhere where it is 
> used.
> 2. Some addrspace casting is necessary in openCL, since strings are in 
> the__constant address space. We must use an IR wrapper function to ensure the 
> correct behaviour. See 
> https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/reflect.ll
>
> We agree that `__nvvm_reflect` is not a perfect solution, but it is something 
> we must use because of libdevice. Adding the clang builtin would enable us to 
> write libdevice-like libraries without having to resort to the methods 
> mentioned above.

LLVM does not maintain bitcode backward compatibility between major releases. 
How do you make sure the clang/LLVM you are using is compatible with libdevice?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-03 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added a comment.

In D137154#3905711 , @yaxunl wrote:

> LLVM does not maintain bitcode backward compatibility between major releases. 
> How do you make sure the clang/LLVM you are using is compatible with 
> libdevice?

Yes it does? We support bitcode auto upgrade as far back as 3.0


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-03 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:827
 
+BUILTIN(__nvvm_reflect, "icC*", "nc")
+

Do we actually need this patch at all.

`extern "C" int __nvvm_reflect(const char *);` appears to work just fine 
already:
https://godbolt.org/z/caGenxn1e





Comment at: clang/test/CodeGenCUDA/nvvm-reflect.cu:15
+// RUN:sm_50 -S -o /dev/null %s -mllvm -print-after-all 2>&1 \
+// RUN:| FileCheck %s --check-prefix=ARCH_REFLECT_1
+// RUN: %clang_cc1 -fcuda-is-device -triple nvptx64-nvidia-cuda -target-cpu \

Nit. I'd use the suffix matching the GPU variant we're targeting.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-04 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added inline comments.



Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:827
 
+BUILTIN(__nvvm_reflect, "icC*", "nc")
+

tra wrote:
> Do we actually need this patch at all.
> 
> `extern "C" int __nvvm_reflect(const char *);` appears to work just fine 
> already:
> https://godbolt.org/z/caGenxn1e
> 
> 
It does work fine unless we are using openCL, which requires the addrspace 
casting as mentioned above


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-04 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: clang/include/clang/Basic/BuiltinsNVPTX.def:827
 
+BUILTIN(__nvvm_reflect, "icC*", "nc")
+

hdelan wrote:
> tra wrote:
> > Do we actually need this patch at all.
> > 
> > `extern "C" int __nvvm_reflect(const char *);` appears to work just fine 
> > already:
> > https://godbolt.org/z/caGenxn1e
> > 
> > 
> It does work fine unless we are using openCL, which requires the addrspace 
> casting as mentioned above
Here's the way the situation looks to me:
* __nvvm_reflect is a technical debt (IMO) that exists because NVIDIA happened 
to use it in libdevice bitcode so we had to make LLVM deal with it.
* It's not intended to be used as a public API outside of libdevice.
* you want to use it in a library. Just one library, AFAICT.
* there is a way to use it, though it does take some casts (BTW, we're only 
checking the name of the function, so you could declare it with a `__constant` 
argument : https://godbolt.org/z/s9EvGfPhe)
* this CL intends to make `__nvvm_reflect` to be available in clang for wider 
use, which will make removing it in the future much harder.

On the balance, I still do not think that it's worth exposing `__nvvm_reflect` 
as a clang builtin.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-09 Thread Hugh Delaney via Phabricator via cfe-commits
hdelan added a comment.

Thanks for feedback. Instead of adding `__nvvm_reflect` as a clang builtin, 
would it be acceptable if I modified the NVVMReflect pass so that it works with 
addrspace casting as well? This would allow us to use `__nvvm_reflect` in openCL


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-09 Thread Artem Belevich via Phabricator via cfe-commits
tra added a subscriber: jhuber6.
tra added a comment.

In D137154#3917692 , @hdelan wrote:

> Thanks for feedback. Instead of adding `__nvvm_reflect` as a clang builtin, 
> would it be acceptable if I modified the NVVMReflect pass

That would be less problematic, but I'm still concerned that it would tacitly 
endorse the use of `__nvvm_reflect` by LLVM users.

> so that it works with addrspace casting as well? This would allow us to use 
> `__nvvm_reflect` in openCL

Relaxing argument type checks on `__nvvm_reflect` function would be fine with 
me.

That said,...

TBH, I still don't quite convinced that compiler changes is the right solution 
for making it possible for *one* library to rely on something that was never 
intended to be exposed to compiler users.

Perhaps we should take a step back, figure out the fundamental problem you need 
to solve (as opposed to figuring out how to make a tactical hack work) and then 
figure out a more principled solution.

> In DPC++ for CUDA we use libclc as a wrapper around CUDA SDK's libdevice. 
> Like libdevice we want to precompile libclc to bc for the CUDA backend 
> without specializing for a particular arch, so that we can call different 
> __nv funcs based on the arch. For this reason we use the __nvvm_reflect llvm 
> intrinsic.

For starters, libdevice by itself is something that's not quite intended for 
the end user. It was a rather poor stop-gap solution to address the fact that 
there used to be no linking phase for GPU binaries and no 'standard' math 
library the code could rely on. The library itself does not have anything 
particularly interesting in it. Its major advantage is that it exists, while we 
don't have our own GPU-side libm yet. We do want to get rid of libdevice and 
replace it with an open-source math library of our own. With the recent 
improvements in offloading support in clang driver we're getting closer to 
making it possible.

As for the code specialization, why not build for individual GPUs? To me it 
looks like this use case is a good match for the "new-driver" offloading that's 
been recently implemented in clang. It allows compiling/linking GPU side code 
and that should obviate the need for shipping bitcode and relying on 
`__nvvm_reflect` for specialization.
The downside is that it's a recent feature, so it would not be available in 
older clang versions. @jhuber6 : I'm also not sure if OpenCL is supported by 
the new driver.

With the new driver, you in theory should be able to compile the source with 
`--offload-arch=A --offload-arch=B` and it would produce an object file with 
GPU-specific bitcode or object file which can then be transparently linked into 
the final executable, where clang would also perform final linking of GPU 
binaries, as well.

I realize that it may be hard or not feasible for your project right now. I'm 
OK with allowing limited nvvm_reflect use for the time being, but please do 
consider making things work w/o it or libdevice, if possible.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-09 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D137154#3917752 , @tra wrote:

> As for the code specialization, why not build for individual GPUs? To me it 
> looks like this use case is a good match for the "new-driver" offloading 
> that's been recently implemented in clang. It allows compiling/linking GPU 
> side code and that should obviate the need for shipping bitcode and relying 
> on `__nvvm_reflect` for specialization.

Dealing with the relative compatibility of GPU targets is a never-ending 
endeavor so I'm a proponent of just compiling the source for every single 
architecture, this is what we do with the `libomptarget` device libraries and 
is how the future `libm` and `libc` libraries will work.

> The downside is that it's a recent feature, so it would not be available in 
> older clang versions. @jhuber6 : I'm also not sure if OpenCL is supported by 
> the new driver.

It's not currently, I could put some cycles into that if needed. Also I'm 
wondering if it might be time to start an RFC for moving CUDA to the new driver 
by default. The one downside would be losing compatibility with CUDA's method 
of RDC compilation but I don't think that was ever a goal.

> With the new driver, you in theory should be able to compile the source with 
> `--offload-arch=A --offload-arch=B` and it would produce an object file with 
> GPU-specific bitcode or object file which can then be transparently linked 
> into the final executable, where clang would also perform final linking of 
> GPU binaries, as well.

Part of me wonders if  we should supply `--offload-arch=all` if people start 
doing this a lot.

> I realize that it may be hard or not feasible for your project right now. I'm 
> OK with allowing limited nvvm_reflect use for the time being, but please do 
> consider making things work w/o it or libdevice, if possible.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-10 Thread Alexey Bader via Phabricator via cfe-commits
bader added a comment.

Is binary size a concern here? NVIDIA, AMD and Intel GPUs are already have ~ 20 
different architectures each, so I want my app/library to run on any GPU from 
these vendors (which is quite reasonable expectation), I'll need to 
have/distribute ~ 60 different binaries. libdevice, libm, libc are quite small, 
but other apps (e.g. ML frameworks) might be quite large, so that distributed 
binary size is important to consider.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-11 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

In D137154#3918949 , @bader wrote:

> Is binary size a concern here? NVIDIA, AMD and Intel GPUs are already have ~ 
> 20 different architectures each, so I want my app/library to run on any GPU 
> from these vendors (which is quite reasonable expectation), I'll need to 
> have/distribute ~ 60 different binaries. libdevice, libm, libc are quite 
> small, but other apps (e.g. ML frameworks) might be quite large, so that 
> distributed binary size is important to consider.

Yes, this probably would become untenable with really large sizes. Compression 
is always an option considering that these would be highly similar, we'd just 
need some way to check if the data stored at the `LLVM_OFFLOADING` section is 
some compressed format and extract it to a new buffer before doing linking. A 
more esoteric option would be to make a tool that takes in the LLVM-IR 
generated for each architecture and tries to unify it, adding branches where 
they differ and guard these with some flags that make the machine translator 
ignore them. Generally I don't think that manually searching for compatibility 
is a good longterm solution for libraries so we could look into alternative 
means (it's required for executables however).


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D137154: Adding nvvm_reflect clang builtin

2022-11-11 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

> Yes, this probably would become untenable with really large sizes.

I think that horse had left the barn long ago. Vast majority of NVIDIA GPU 
cycles these days spent in libraries shipped by NVIDIA and those are huge 
(O(1GB)), numerous (cuDNN, TensorRT, cuBLAS, cuSPARSE, cuFFT) and contain 
binaries for all major GPU variants NVIDIA supports (sm3x,5x,6x,7x,8x,9x). 
AMD's ROCm stack is comparable in size, too, I think.
Yes, it does cause issues when one links in everything statically. E.g. we 
sometimes end up causing overflows of 32-bit signed ELF relocations, but 
typically users do link with shared libraries and avoid that particular issue.

Distributing large binaries does end up being a challenge, mostly due to the 
storage and transfer costs. Tensorflow had been forced to split GPU support 
bits into a separate package and limit the set of the supported GPUs official 
packages are compiled for. I think we also considered per-GPU variant packages, 
but that would multiply storage costs.

However, most of the users and libraries are either nowhere near cuDNN/ROCm in 
terms of size and the size increase will likely be manageable, or already have 
to deal with the size issues and the size increase will be insignificant 
compared to existing size.
JIT would be another option, but it has its own challenges at that scale.

Overall I think there's enough existing use cases to consider compiling for 
all/most supported GPUs to be a practical solution, even for large projects. It 
does not mean it will work for everyone, but I think specifically for `libclc` 
compiling for all GPUs would be the right thing to do.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D137154/new/

https://reviews.llvm.org/D137154

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits