yaxunl added a comment.

In D86376#2234259 <https://reviews.llvm.org/D86376#2234259>, @tra wrote:

> How much does this inlining buy you in practice? I.e. what's a typical launch 
> latency before/after the patch? For CUDA, config push/pop is negligible 
> compared to the cost of actually launching the kernel on the GPU. It is 
> measurable if the launch is asynchronous, but queueing kernels fast, does not 
> help all that much in the long run -- you eventually have to run those 
> kernels on the GPU, so in most cases you're just spend a bit more time idling 
> while waiting for the queued kernels to finish. To be beneficial, you'll need 
> a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to 
> the point where the minor savings here would be meaningful. I would assume 
> the situation on AMD GPUs is not that different.

`__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 ns 
overhead, whereas we have requests to squeeze any overhead in kernel launching 
latency.

> One side effect of this patch is that there will be no convenient way to set 
> host-side breakpoint on kernel launch.
> Another will be that examining call stack will become somewhat confusing as 
> the arguments passed to the kernel as written in the source code will not 
> match those observed in the stack trace. I guess preserving the appearance of 
> normal function calls was the reason for the split  config setup/kernel 
> launch in CUDA.  I'd say it's still useful to have as CUDA-specific debugger 
> is not always available and one must use regular gdb on CUDA apps now and 
> then.

Eliminating kernel stub does not affect debugability negatively. At least this 
is true for HIP debugger. Actually our debugger team intentionally requests to 
eliminate any debug information for the kernel stub so that it will not confuse 
the debugger with the real kernel. This is because the kernel stub is an 
artificial function for launching the kernel, not the real kernel which is in 
device binary. For HIP debugger (rocmgdb), when the user set break point on a 
kernel, it will break on the real kernel in device binary, and the call stack 
are displayed correctly. The arguments to the real kernel are not lost, since 
the real kernel is a real function in device binary.

Another motivation for eliminating kernel stub is to be able to emit a symbol 
with the same mangled name as a kernel as a global variable instead of a 
function. Since we need such symbols to be able to launch kernels with mangled 
name in a C++ program. If we use kernel stub as the symbol, we cannot use the 
original mangled kernel name since our debugger does not allow that.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to