[PATCH] D86376: [HIP] Improve kernel launching latency

Artem Belevich via Phabricator via cfe-commits Mon, 24 Aug 2020 15:50:03 -0700

tra added a comment.

In D86376#2234719 <https://reviews.llvm.org/D86376#2234719>, @yaxunl wrote:


>> This patch appears to be somewhere in the gray area to me. My prior 
>> experience with CUDA suggests that it will make little to no difference. On 
>> the other hand, AMD GPUs may be different enough to prove me wrong. Without 
>> specific evidence, I still can't tell what's the case here.
>
> Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 
> us. The typical kernel launching latency is about 500us, therefore the 
> improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound 
right. 0.5 millisecond per kernel launch is also suspiciously high. 
For CUDA it's ~5us 
(https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If 
it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch 
config data, the implementation may be doing something wrong. We're talking 
about O(100) syscalls and that's way too much work for something that simple. 
What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

> To run HIP applications, users need to install ROCm, which includes rocgdb.

I would disagree with that assertion. I do very much want to build a 
Tensorflow-based app and run it in a container with nothing else but the app 
and I do want to use existing infrastructure to capture relevant info if the 
app crashes. Such capture will not be using any HIP-specific tools. 
Or I could give it to a user who absolutely does not care what's inside the 
executable, but who may want to run it under gdb if something goes wrong.

> A debugger without device code debugging capability has little use with HIP 
> applications therefore I would expect users to always use rocgdb to debug HIP 
> program.

I agree that it's indeed the case if someone wants/needs to debug GPU code, 
however, in many cases it's sufficient to be able to debug host-side things 
only. And it is useful to see the point where we launch kernels and be able to 
tell which kernel it was.

> Also, since clang already removed all debug information for kernel stub, gdb 
> cannot break on kernel stub any way.

gdb is aware of the ELF symbols and those are often exposed in shared 
libraries. While you will not have type info, etc, you can still set a 
breakpoint and get a sensible stack trace in many cases. We usually build with 
some amount of debug info and it did prove rather helpful to pin-point GPU 
failures via host-side stack trace as it did include the symbol name of the 
host-side stub which allows identifying the device-side kernel. If all we see 
in the stack trace is `hipLaunchKernel`, it would be considerably less helpful, 
especially when there's no detailed debug info which would allow us to dig out 
the kernel name from its arguments. All we'd know that we've launched *some* 
kernel.

>> Is eliminating the host-side stub the goal, or just a coincidental 
>> side-effect? I.e. if it's something you *need* to do, then the discussion 
>> about minor performance gain becomes rather irrelevant and we should weigh 
>> 'improvements in HIP debugging' vs 'regression in host-only debugging' 
>> instead.
>
> I would like to say the motivation of this change is two folds: 1. improve 
> latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think 
I see how this patch helps with that. Or what exactly is the issue with C++ 
interoperability we have now?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86376: [HIP] Improve kernel launching latency

Reply via email to