[PATCH] D86376: [HIP] Improve kernel launching latency

Artem Belevich via Phabricator via cfe-commits Mon, 24 Aug 2020 13:40:45 -0700

tra added a comment.

I'm OK with how the patch is implemented.
I'm still on the fence regarding whether it should be implemented.

In D86376#2234458 <https://reviews.llvm.org/D86376#2234458>, @yaxunl wrote:

> `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 
> ns overhead, whereas we have requests to squeeze any overhead in kernel 
> launching latency.

That's about the same as 1 cache miss. I'm willing to bet that it will be lost 
in the noise. Are there any real world benchmarks where it makes a difference?
Are those requests driven by a specific use case? Not all requests (even well 
intentioned ones) are worth implementing. 
This patch appears to be somewhere in the gray area to me. My prior experience 
with CUDA suggests that it will make little to no difference. On the other 
hand, AMD GPUs may be different enough to prove me wrong. Without specific 
evidence, I still can't tell what's the case here.

>> One side effect of this patch is that there will be no convenient way to set 
>> host-side breakpoint on kernel launch.
>> Another will be that examining call stack will become somewhat confusing as 
>> the arguments passed to the kernel as written in the source code will not 
>> match those observed in the stack trace. I guess preserving the appearance 
>> of normal function calls was the reason for the split  config setup/kernel 
>> launch in CUDA.  I'd say it's still useful to have as CUDA-specific debugger 
>> is not always available and one must use regular gdb on CUDA apps now and 
>> then.
>
> Eliminating kernel stub does not affect debugability negatively. At least 
> this is true for HIP debugger. Actually our debugger team intentionally 
> requests to eliminate any debug information for the kernel stub so that it 
> will not confuse the debugger with the real kernel. This is because the 
> kernel stub is an artificial function for launching the kernel, not the real 
> kernel which is in device binary. For HIP debugger (rocmgdb), when the user 
> set break point on a kernel, it will break on the real kernel in device 
> binary, and the call stack are displayed correctly. The arguments to the real 
> kernel are not lost, since the real kernel is a real function in device 
> binary.

You appear to assume debuggability with HIP-aware debugger. That part I'm not 
particularly concerned about as I assume that it will be tested on AMD's side.
I was mostly concerned about debuggability with the ordinary gdb. Imagine 
someone having to debug a TF app they've got somewhere. The end user may not 
even have HIP tools installed. It would be useful to be able to debug until the 
point where control is passed to the GPU. The patch will likely have a minor, 
but still negative impact on that.

I guess one should still be able to set a breakpoint using the `file:line 
number`. If you could verify that it still works with gdb, that would be a 
reasonable workaround. I think we still need to have some way to set a 
breakpoint on the kernel launch site (I think it should still work) and on the 
kernel entry.

So, we have a trade-off of minor performance gain vs a minor debuggability 
regression. I don't have strong opinions which is the best way to go. By 
default, with no demonstrated benefit, I'd err on the side of not changing 
things.

> Another motivation for eliminating kernel stub is to be able to emit a symbol 
> with the same mangled name as a kernel as a global variable instead of a 
> function. Since we need such symbols to be able to launch kernels with 
> mangled name in a C++ program. If we use kernel stub as the symbol, we cannot 
> use the original mangled kernel name since our debugger does not allow that.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? 
I.e. if it's something you *need* to do, then the discussion about minor 
performance gain becomes rather irrelevant and we should weigh 'improvements in 
HIP debugging' vs 'regression in host-only debugging' instead.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86376: [HIP] Improve kernel launching latency

Reply via email to