[PATCH] D86376: [HIP] Improve kernel launching latency

Yaxun Liu via Phabricator via cfe-commits Tue, 25 Aug 2020 09:17:25 -0700

yaxunl added a comment.

In D86376#2234824 <https://reviews.llvm.org/D86376#2234824>, @tra wrote:

> In D86376#2234719 <https://reviews.llvm.org/D86376#2234719>, @yaxunl wrote:
>
>>> This patch appears to be somewhere in the gray area to me. My prior 
>>> experience with CUDA suggests that it will make little to no difference. On 
>>> the other hand, AMD GPUs may be different enough to prove me wrong. Without 
>>> specific evidence, I still can't tell what's the case here.
>>
>> Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 
>> 60 us. The typical kernel launching latency is about 500us, therefore the 
>> improvement is around 10%.
>
> 60 *micro seconds* to store/load something from memory? It does not sound 
> right. 0.5 millisecond per kernel launch is also suspiciously high. 
> For CUDA it's ~5us 
> (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). 
> If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of 
> launch config data, the implementation may be doing something wrong. We're 
> talking about O(100) syscalls and that's way too much work for something that 
> simple. What do those calls do?
>
> Can you confirm that the units are indeed microseconds and not nanoseconds?

My previous measurements did not warming up, which caused some one time 
overhead due to device initialization and loading of device binary. With warm 
up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 
us. Based on the trace from rocprofile, the time spent inside these functions 
can be ignored. Most of the time is spent making the calls. These functions 
stay in a shared library, which may be the reason why they take such long time. 
Making them always_inline may get rid of the overhead, however, that would 
require exposing internal data structures.

The kernel launching latency are measured by a simple loop in which a simple 
kernel is launched then hipStreamSynchronize is called. trace is collected by 
rocprofiler and the latency is measured from the end of hipStreamSynchronize to 
the real start of kernel execution. Without this patch, the latency is about 77 
us. With this patch, the latency is about 46 us. The improvement is about 40%. 
The decrement of 31 us is more than 19 us since it also eliminates the overhead 
of kernel stub.

>> I would like to say the motivation of this change is two folds: 1. improve 
>> latency 2. interoperability with C++ programs.
>
> Could you elaborate on the "interoperability with C++ programs"? I don't 
> think I see how this patch helps with that. Or what exactly is the issue with 
> C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each 
kernel. This symbol is associated with the device kernel by a call of 
hipRegisterFunction in init functions. Each time the kernel needs to be called, 
the associated symbol is passed to hipLaunchKernel. In host code, this symbol 
represents the kernel. Let's call it the kernel symbol. Currently it is the 
kernel stub function, however, it could be any global symbol, as long as it is 
registered with hipRegisterFunction, then hipLaunchKernel can use it to find 
the right kernel and launch it.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the 
kernel symbol. Since the kernel symbol is defined in object files generated 
from HIP. For C/C++ program, as long as it declares the kernel symbol as an 
external function or variable which matches the name of the original symbol, 
the linker will resolve to the correct kernel symbol, then the correct kernel 
can be launched.

Here comes the nuance with kernel stub function as the kernel symbol. If you 
still remember, there was a previous patch for HIP to change the kernel stub 
name. rocgdb requires the device stub to have a different name than the real 
kernel, since otherwise it will not be able to break on the real kernel only. 
As a result, the kernel stub now has a prefix `__device_stub_` before mangling.

For example, a kernel `foo` will have a kernel stub with name 
`__device_stub_foo`.

For a C/C++ program to call kernel `foo`, it needs to declare an external 
symbol `__device_stub_foo` then launch it. Of course this is an annoyance for 
C/C++ users, especially this involves mangled names.

However, we cannot change the name of the kernel stub to be the same as the 
kernel, since that will break rocgdb.

Now the solution is to get rid of the kernel stub function. Instead of use 
kernel stub function as kernel symbol, we will emit a global variable as kernel 
symbol. This global variable can have the same name as the kernel, since rocgdb 
will not break on it.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D86376: [HIP] Improve kernel launching latency

Reply via email to