[PATCH] D86376: [HIP] Improve kernel launching latency

Yaxun Liu via Phabricator via cfe-commits Mon, 24 Aug 2020 14:52:03 -0700

yaxunl added a comment.



In D86376#2234547 <https://reviews.llvm.org/D86376#2234547>, @tra wrote:

> I'm OK with how the patch is implemented.
> I'm still on the fence regarding whether it should be implemented.
>
> In D86376#2234458 <https://reviews.llvm.org/D86376#2234458>, @yaxunl wrote:
>
>> `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 
>> ns overhead, whereas we have requests to squeeze any overhead in kernel 
>> launching latency.
>
> That's about the same as 1 cache miss. I'm willing to bet that it will be 
> lost in the noise. Are there any real world benchmarks where it makes a 
> difference?
> Are those requests driven by a specific use case? Not all requests (even well 
> intentioned ones) are worth implementing. 
> This patch appears to be somewhere in the gray area to me. My prior 
> experience with CUDA suggests that it will make little to no difference. On 
> the other hand, AMD GPUs may be different enough to prove me wrong. Without 
> specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 
us. The typical kernel launching latency is about 500us, therefore the 
improvement is around 10%.

>>> One side effect of this patch is that there will be no convenient way to 
>>> set host-side breakpoint on kernel launch.
>>> Another will be that examining call stack will become somewhat confusing as 
>>> the arguments passed to the kernel as written in the source code will not 
>>> match those observed in the stack trace. I guess preserving the appearance 
>>> of normal function calls was the reason for the split  config setup/kernel 
>>> launch in CUDA.  I'd say it's still useful to have as CUDA-specific 
>>> debugger is not always available and one must use regular gdb on CUDA apps 
>>> now and then.
>>
>> Eliminating kernel stub does not affect debugability negatively. At least 
>> this is true for HIP debugger. Actually our debugger team intentionally 
>> requests to eliminate any debug information for the kernel stub so that it 
>> will not confuse the debugger with the real kernel. This is because the 
>> kernel stub is an artificial function for launching the kernel, not the real 
>> kernel which is in device binary. For HIP debugger (rocmgdb), when the user 
>> set break point on a kernel, it will break on the real kernel in device 
>> binary, and the call stack are displayed correctly. The arguments to the 
>> real kernel are not lost, since the real kernel is a real function in device 
>> binary.
>
> You appear to assume debuggability with HIP-aware debugger. That part I'm not 
> particularly concerned about as I assume that it will be tested on AMD's side.
> I was mostly concerned about debuggability with the ordinary gdb. Imagine 
> someone having to debug a TF app they've got somewhere. The end user may not 
> even have HIP tools installed. It would be useful to be able to debug until 
> the point where control is passed to the GPU. The patch will likely have a 
> minor, but still negative impact on that.
>
> I guess one should still be able to set a breakpoint using the `file:line 
> number`. If you could verify that it still works with gdb, that would be a 
> reasonable workaround. I think we still need to have some way to set a 
> breakpoint on the kernel launch site (I think it should still work) and on 
> the kernel entry.

To run HIP applications, users need to install ROCm, which includes rocgdb. A 
debugger without device code debugging capability has little use with HIP 
applications therefore I would expect users to always use rocgdb to debug HIP 
program. Also, since clang already removed all debug information for kernel 
stub, gdb cannot break on kernel stub any way.

>> Another motivation for eliminating kernel stub is to be able to emit a 
>> symbol with the same mangled name as a kernel as a global variable instead 
>> of a function. Since we need such symbols to be able to launch kernels with 
>> mangled name in a C++ program. If we use kernel stub as the symbol, we 
>> cannot use the original mangled kernel name since our debugger does not 
>> allow that.
>
> Is eliminating the host-side stub the goal, or just a coincidental 
> side-effect? I.e. if it's something you *need* to do, then the discussion 
> about minor performance gain becomes rather irrelevant and we should weigh 
> 'improvements in HIP debugging' vs 'regression in host-only debugging' 
> instead.

I would like to say the motivation of this change is two folds: 1. improve 
latency 2. interoperability with C++ programs.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86376: [HIP] Improve kernel launching latency

Reply via email to