yaxunl added a comment. In D86376#2234259 <https://reviews.llvm.org/D86376#2234259>, @tra wrote:
> How much does this inlining buy you in practice? I.e. what's a typical launch > latency before/after the patch? For CUDA, config push/pop is negligible > compared to the cost of actually launching the kernel on the GPU. It is > measurable if the launch is asynchronous, but queueing kernels fast, does not > help all that much in the long run -- you eventually have to run those > kernels on the GPU, so in most cases you're just spend a bit more time idling > while waiting for the queued kernels to finish. To be beneficial, you'll need > a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to > the point where the minor savings here would be meaningful. I would assume > the situation on AMD GPUs is not that different. `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 ns overhead, whereas we have requests to squeeze any overhead in kernel launching latency. > One side effect of this patch is that there will be no convenient way to set > host-side breakpoint on kernel launch. > Another will be that examining call stack will become somewhat confusing as > the arguments passed to the kernel as written in the source code will not > match those observed in the stack trace. I guess preserving the appearance of > normal function calls was the reason for the split config setup/kernel > launch in CUDA. I'd say it's still useful to have as CUDA-specific debugger > is not always available and one must use regular gdb on CUDA apps now and > then. Eliminating kernel stub does not affect debugability negatively. At least this is true for HIP debugger. Actually our debugger team intentionally requests to eliminate any debug information for the kernel stub so that it will not confuse the debugger with the real kernel. This is because the kernel stub is an artificial function for launching the kernel, not the real kernel which is in device binary. For HIP debugger (rocmgdb), when the user set break point on a kernel, it will break on the real kernel in device binary, and the call stack are displayed correctly. The arguments to the real kernel are not lost, since the real kernel is a real function in device binary. Another motivation for eliminating kernel stub is to be able to emit a symbol with the same mangled name as a kernel as a global variable instead of a function. Since we need such symbols to be able to launch kernels with mangled name in a C++ program. If we use kernel stub as the symbol, we cannot use the original mangled kernel name since our debugger does not allow that. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D86376/new/ https://reviews.llvm.org/D86376 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits