yaxunl added a comment.
In D86376#2234547 <https://reviews.llvm.org/D86376#2234547>, @tra wrote: > I'm OK with how the patch is implemented. > I'm still on the fence regarding whether it should be implemented. > > In D86376#2234458 <https://reviews.llvm.org/D86376#2234458>, @yaxunl wrote: > >> `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 >> ns overhead, whereas we have requests to squeeze any overhead in kernel >> launching latency. > > That's about the same as 1 cache miss. I'm willing to bet that it will be > lost in the noise. Are there any real world benchmarks where it makes a > difference? > Are those requests driven by a specific use case? Not all requests (even well > intentioned ones) are worth implementing. > This patch appears to be somewhere in the gray area to me. My prior > experience with CUDA suggests that it will make little to no difference. On > the other hand, AMD GPUs may be different enough to prove me wrong. Without > specific evidence, I still can't tell what's the case here. Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 us. The typical kernel launching latency is about 500us, therefore the improvement is around 10%. >>> One side effect of this patch is that there will be no convenient way to >>> set host-side breakpoint on kernel launch. >>> Another will be that examining call stack will become somewhat confusing as >>> the arguments passed to the kernel as written in the source code will not >>> match those observed in the stack trace. I guess preserving the appearance >>> of normal function calls was the reason for the split config setup/kernel >>> launch in CUDA. I'd say it's still useful to have as CUDA-specific >>> debugger is not always available and one must use regular gdb on CUDA apps >>> now and then. >> >> Eliminating kernel stub does not affect debugability negatively. At least >> this is true for HIP debugger. Actually our debugger team intentionally >> requests to eliminate any debug information for the kernel stub so that it >> will not confuse the debugger with the real kernel. This is because the >> kernel stub is an artificial function for launching the kernel, not the real >> kernel which is in device binary. For HIP debugger (rocmgdb), when the user >> set break point on a kernel, it will break on the real kernel in device >> binary, and the call stack are displayed correctly. The arguments to the >> real kernel are not lost, since the real kernel is a real function in device >> binary. > > You appear to assume debuggability with HIP-aware debugger. That part I'm not > particularly concerned about as I assume that it will be tested on AMD's side. > I was mostly concerned about debuggability with the ordinary gdb. Imagine > someone having to debug a TF app they've got somewhere. The end user may not > even have HIP tools installed. It would be useful to be able to debug until > the point where control is passed to the GPU. The patch will likely have a > minor, but still negative impact on that. > > I guess one should still be able to set a breakpoint using the `file:line > number`. If you could verify that it still works with gdb, that would be a > reasonable workaround. I think we still need to have some way to set a > breakpoint on the kernel launch site (I think it should still work) and on > the kernel entry. To run HIP applications, users need to install ROCm, which includes rocgdb. A debugger without device code debugging capability has little use with HIP applications therefore I would expect users to always use rocgdb to debug HIP program. Also, since clang already removed all debug information for kernel stub, gdb cannot break on kernel stub any way. >> Another motivation for eliminating kernel stub is to be able to emit a >> symbol with the same mangled name as a kernel as a global variable instead >> of a function. Since we need such symbols to be able to launch kernels with >> mangled name in a C++ program. If we use kernel stub as the symbol, we >> cannot use the original mangled kernel name since our debugger does not >> allow that. > > Is eliminating the host-side stub the goal, or just a coincidental > side-effect? I.e. if it's something you *need* to do, then the discussion > about minor performance gain becomes rather irrelevant and we should weigh > 'improvements in HIP debugging' vs 'regression in host-only debugging' > instead. I would like to say the motivation of this change is two folds: 1. improve latency 2. interoperability with C++ programs. CHANGES SINCE LAST ACTION https://reviews.llvm.org/D86376/new/ https://reviews.llvm.org/D86376 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits