[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread John McCall via Phabricator via cfe-commits
rjmccall added a comment.

In D120566#3346533 , @arsenm wrote:

> In D120566#3346506 , @rjmccall 
> wrote:
>
>> Is there something which stops you from taking the address of a kernel and 
>> then calling it?  If not, are there actually any uses of kernels in the 
>> module that shouldn't be rewritten as uses of the clone?
>
> The actual amdgpu_kernel is uncallable and has a totally different ABI, and 
> is invoked by external driver code. From the user's device code perspective, 
> only the callable function version is meaningful.

I think you're misunderstanding what I'm asking.  I believe that in OpenCL, you 
can do `&someKernelFunction` in source code and then call that.  The rewrite in 
this patch does not handle non-call uses of the kernel function and so will 
continue to miscompile them.

>> I feel like this would be a lot easier to just fix in your LLVM pass so that 
>> you rewrite any uses of a kernel to use a clone instead before you rewrite 
>> the kernel.
>
> Then we can't ban calls to kernels (and would be pushing some kind of symbol 
> naming conflict decision into the backend) and in principle would have to 
> handle this actual call.

Okay, this is not an accurate description of what you're trying to do, and this 
is important to be precise about.  You are not "banning calls to kernels", 
which would be a novel language restriction and make you non-conformant to 
OpenCL.  You still have a language requirement to allow code to directly use 
kernel functions.  That is why this patch is modifying IR generation instead of 
emitting new errors in Sema.

What's happening here is that your target (very reasonably) requires kernels to 
have a special kernel entrypoint in order to be called from outside.  That 
entrypoint uses a very different ABI from ordinary functions, one which 
simplifies being dynamically called by the runtime, and so it is important that 
ordinary uses of the function don't accidentally resolve against that special 
entrypoint.  You therefore need two different functions for the kernel, one to 
satisfy standard uses and one to act as the kernel entrypoint.

Your current architecture is to generate code normally, which will produce 
what's roughly the standard entrypoint, and then have a backend pass break that 
down and produce a kernel entrypoint.  I can understand why you find that 
frustratingly limited, and I agree that it doesn't seem to handle standard uses 
correctly.  Something needs to change here.

Now, Clang supports many different kernel languages, all of which face very 
similar language/implementation problems.  It is therefore always informative 
to go check to see how other language implementors have tried to solve the 
problem you're facing.  So if you go and look at how CUDA is implemented in 
Clang, you will see that they have introduced a "kernel reference kind" to 
`GlobalDecl`, which allows them to distinguish between the kernel entrypoint 
and the standard entrypoint in IRGen.  You could very easily build on this in 
your OpenCL implementation so that Clang emits the standard entrypoint and then 
either your pass or IRGen itself fills in the kernel entrypoint by marshaling 
arguments and then calling (presumably in a way that forces inlining) the 
standard entrypoint.  That would also give you total control of how arguments 
are marshaled in the kernel entrypoint.

Alternatively, I think cloning the standard entrypoint so that uses of it are 
rewritten to a clone is reasonable enough.  I don't really see why doing the 
cloning in IRGen is necessary when you already have a module pass that does 
similar kinds of rewriting.  Doing the clone in IRGen also does not seem to 
move you closer to your goal of actually marshaling arguments differently.  
Most importantly, though, I believe you do need to rewrite all the uses and not 
just the direct calls.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120566/new/

https://reviews.llvm.org/D120566

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added a comment.

In D120566#3346506 , @rjmccall wrote:

> Is there something which stops you from taking the address of a kernel and 
> then calling it?  If not, are there actually any uses of kernels in the 
> module that shouldn't be rewritten as uses of the clone?

The actual amdgpu_kernel is uncallable and has a totally different ABI, and is 
invoked by external driver code. From the user's device code perspective, only 
the callable function version is meaningful.

> I feel like this would be a lot easier to just fix in your LLVM pass so that 
> you rewrite any uses of a kernel to use a clone instead before you rewrite 
> the kernel.

Then we can't ban calls to kernels (and would be pushing some kind of symbol 
naming conflict decision into the backend) and in principle would have to 
handle this actual call.

We also don't really want these to have similar/compatible signatures where you 
can just swap out the call target. For example I want to more drastically 
change the IR used for aggregates between the two cases.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120566/new/

https://reviews.llvm.org/D120566

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread John McCall via Phabricator via cfe-commits
rjmccall added a comment.

Is there something which stops you from taking the address of a kernel and then 
calling it?  If not, are there actually any uses of kernels in the module that 
shouldn't be rewritten as uses of the clone?

I feel like this would be a lot easier to just fix in your LLVM pass so that 
you rewrite any uses of a kernel to use a clone instead before you rewrite the 
kernel.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120566/new/

https://reviews.llvm.org/D120566

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added inline comments.



Comment at: clang/lib/CodeGen/TargetInfo.cpp:9238
 
+static llvm::Function *getKernelClone(llvm::Function &F) {
+  llvm::Module *M = F.getParent();

I don't think we can really start with the function IR. The TargetABIInfo could 
be different from the kernel and function form (and will due to using 
byval/byref etc.)



Comment at: clang/lib/CodeGen/TargetInfo.cpp:9240
+  llvm::Module *M = F.getParent();
+  SmallString<128> MangledName("__amdgpu_");
+  MangledName.append(F.getName());

I don't think adding a prefix and suffix is a good strategy for something which 
in principle should be ABI visible. A period + suffix I think would be a better 
convention



Comment at: clang/lib/CodeGen/TargetInfo.cpp:9478-9479
+
+  CI->setCalledFunction(Clone);
+  CI->setCallingConv(llvm::CallingConv::C);
+}

This is basically just moving what the current hack does into clang. Can we 
emit calls to the function version up front?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120566/new/

https://reviews.llvm.org/D120566

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added a comment.

In D120566#3345604 , @yaxunl wrote:

> One of my concerns is that all kernels are duplicated which may cause code 
> object size doubled.

Not really, the kernel should just be a stub that calls the real implementation 
function. In the real world this will always be inlined

> Do we need to make the clone always_inline and let the kernel call its clone 
> to avoid duplicate function bodies? Or LLVM has some pass to do that?

It's not a special case, there's no real need to put always_inline. Nobody uses 
this feature in the real world anyway, and single use functions will be inlined

> Another concern is that the duplicate non-kernel functions have actual kernel 
> ABI. Not sure if that can cause any issues.

My main question is how we have the symbol for the kernel and function coexist


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120566/new/

https://reviews.llvm.org/D120566

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

One of my concerns is that all kernels are duplicated which may cause code 
object size doubled.

Do we need to make the clone always_inline and let the kernel call its clone to 
avoid duplicate function bodies? Or LLVM has some pass to do that?

Another concern is that the duplicate non-kernel functions have actual kernel 
ABI. Not sure if that can cause any issues.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120566/new/

https://reviews.llvm.org/D120566

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D120566: [OpenCL][AMDGPU]: Do not allow a call to kernel

2022-02-25 Thread Christudasan Devadasan via Phabricator via cfe-commits
cdevadas created this revision.
cdevadas added reviewers: rjmccall, Anastasia, yaxunl, arsenm.
Herald added subscribers: Naghasan, ldrumm, kerbowa, t-tye, tpr, dstuttard, 
jvesely, kzhuravl.
cdevadas requested review of this revision.
Herald added subscribers: cfe-commits, wdng.
Herald added a project: clang.

In OpenCL, a kernel is allowed to call other kernels as if
they are regular functions. To support it, clang emits
amdgpu_kernel calling convention for both caller and callee.
A backend pass in our downstream compiler alters such calls
by introducing regular function bodies which are clones of
the callee kernels. This implementation currently limits us
in certain ways. For instance, the restriction to not use
byref attribute for callee kernels.

To avoid such limitations, this patch brings in those
cloned functions early on and prevents clang from generating
amdgpu_kernel call sites. A new function body will be added
for each kernel in the compilation unit expecting that the
unused clones will get removed at link time.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D120566

Files:
  clang/lib/CodeGen/CodeGenModule.cpp
  clang/lib/CodeGen/TargetInfo.cpp
  clang/lib/CodeGen/TargetInfo.h
  clang/test/CodeGenOpenCL/amdgpu-kernel-calls.cl
  clang/test/CodeGenOpenCL/visibility.cl

Index: clang/test/CodeGenOpenCL/visibility.cl
===
--- clang/test/CodeGenOpenCL/visibility.cl
+++ clang/test/CodeGenOpenCL/visibility.cl
@@ -94,23 +94,6 @@
 ext_func_default();
 }
 
-// FVIS-DEFAULT: declare amdgpu_kernel void @ext_kern()
-// FVIS-PROTECTED: declare protected amdgpu_kernel void @ext_kern()
-// FVIS-HIDDEN: declare protected amdgpu_kernel void @ext_kern()
-
-// FVIS-DEFAULT: declare protected amdgpu_kernel void @ext_kern_hidden()
-// FVIS-PROTECTED: declare protected amdgpu_kernel void @ext_kern_hidden()
-// FVIS-HIDDEN: declare protected amdgpu_kernel void @ext_kern_hidden()
-
-// FVIS-DEFAULT: declare protected amdgpu_kernel void @ext_kern_protected()
-// FVIS-PROTECTED: declare protected amdgpu_kernel void @ext_kern_protected()
-// FVIS-HIDDEN: declare protected amdgpu_kernel void @ext_kern_protected()
-
-// FVIS-DEFAULT: declare amdgpu_kernel void @ext_kern_default()
-// FVIS-PROTECTED: declare amdgpu_kernel void @ext_kern_default()
-// FVIS-HIDDEN: declare amdgpu_kernel void @ext_kern_default()
-
-
 // FVIS-DEFAULT: declare void @ext_func()
 // FVIS-PROTECTED: declare protected void @ext_func()
 // FVIS-HIDDEN: declare hidden void @ext_func()
@@ -126,3 +109,21 @@
 // FVIS-DEFAULT: declare void @ext_func_default()
 // FVIS-PROTECTED: declare void @ext_func_default()
 // FVIS-HIDDEN: declare void @ext_func_default()
+
+// A kernel call will be emitted as a call to its cloned function
+// of non-kernel convention.
+// FVIS-DEFAULT: declare void @__amdgpu_ext_kern_kernel_body()
+// FVIS-PROTECTED: declare void @__amdgpu_ext_kern_kernel_body()
+// FVIS-HIDDEN: declare void @__amdgpu_ext_kern_kernel_body()
+
+// FVIS-DEFAULT: declare void @__amdgpu_ext_kern_hidden_kernel_body()
+// FVIS-PROTECTED: declare void @__amdgpu_ext_kern_hidden_kernel_body()
+// FVIS-HIDDEN: declare void @__amdgpu_ext_kern_hidden_kernel_body()
+
+// FVIS-DEFAULT: declare void @__amdgpu_ext_kern_protected_kernel_body()
+// FVIS-PROTECTED: declare void @__amdgpu_ext_kern_protected_kernel_body()
+// FVIS-HIDDEN: declare void @__amdgpu_ext_kern_protected_kernel_body()
+
+// FVIS-DEFAULT: declare void @__amdgpu_ext_kern_default_kernel_body()
+// FVIS-PROTECTED: declare void @__amdgpu_ext_kern_default_kernel_body()
+// FVIS-HIDDEN: declare void @__amdgpu_ext_kern_default_kernel_body()
Index: clang/test/CodeGenOpenCL/amdgpu-kernel-calls.cl
===
--- /dev/null
+++ clang/test/CodeGenOpenCL/amdgpu-kernel-calls.cl
@@ -0,0 +1,60 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -S -disable-llvm-passes -emit-llvm -o - %s | FileCheck %s
+
+// AMDGPU disallows kernel callsites from another kernels. For each kernel, clang codegen will introduce
+// a cloned function body with a non-kernel calling convention and amdgpu_kernel callsites will get
+// transformed to call appropriate clones.
+
+extern kernel void test_extern_kernel_callee(global int *in);
+
+// CHECK: define dso_local amdgpu_kernel void @test_kernel_callee(i32 addrspace(1)* noundef align 4 %in)
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:[[IN_ADDR:%.*]] = alloca i32 addrspace(1)*, align 8, addrspace(5)
+// CHECK-NEXT:store i32 addrspace(1)* [[IN:%.*]], i32 addrspace(1)* addrspace(5)* [[IN_ADDR]], align 8
+// CHECK-NEXT:[[TMP0:%.*]] = load i32 addrspace(1)*, i32 addrspace(1)* addrspace(5)* [[IN_ADDR]], align 8
+// CHECK-NEXT:store i32 10, i32 addrspace(1)* [[TMP0]], align 4
+// CHECK-NEXT:ret void
+//
+kernel void test_kernel_callee(global int *in) {
+  *in = (int)(10);
+}
+
+// CHECK: define dso_lo