[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-26 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D86376#2236704 , @tra wrote:

> 



> It's still suspiciously high. AFAICT, config/push/pull is just an std::vector 
> push/pop. It should not take *that* long.  Few function calls should not lead 
> to microseconds of overhead, once linker has resolved the symbol, if they 
> come from a shared library.
> https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590
>
> I wonder if it's the logging facilities that add all this overhead.

You are right. The 19 us are mostly due to overhead from rocprofiler. If I do 
not use rocprofiler and use a simple loop to measure execution time of 
`__hipPushCallConfigure/__hipPopCallConfigure`, I got 180 ns.

>> The kernel launching latency are measured by a simple loop in which a simple 
>> kernel is launched then hipStreamSynchronize is called. trace is collected 
>> by rocprofiler and the latency is measured from the end of 
>> hipStreamSynchronize to the real start of kernel execution. Without this 
>> patch, the latency is about 77 us. With this patch, the latency is about 46 
>> us. The improvement is about 40%. The decrement of 31 us is more than 19 us 
>> since it also eliminates the overhead of kernel stub.
>
> This is rather surprising. A function call by itself does *not* have such 
> high overhead. There must be something else. I strongly suspect logging. If 
> you remove logging statements from push/pop without changing anything else, 
> how does that affect performance?

The 19 us overhead was due to rocprofiler. Without rocprofiler, I can only 
measure the average duration of a kernel launching together with 
hipStreamSynchronize. When the kernel is empty, it serves as an estimation of 
kernel launching latency. With such measurement, the latency is about 14.0 us. 
The improvement due to this patch is not significant.

>> In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the 
>> kernel symbol.
>
> Do you mean the host-side symbol, registered with the runtime that you've 
> described above? Or do you mean that the device-side symbol is somehow 
> visible from the host side. I think that's where HIP is different from CUDA.

I mean the host-side symbol. A host program can only use host-side symbol to 
launch a kernel.

> I do not follow your reasoning why the stub name is a problem. It's awkward, 
> yes, but losing the stub as a specific kernel entry point seems to be a real 
> loss in debugability, which is worse, IMO.
> Could you give me an example where the stub name causes problems?

For example, in HIP program, there is a kernel `void foo(int*)`. If a C++ 
program wants to launch it, the desirable way is

  void foo(int*);
  hipLaunchKernel(foo, grids, blocks, args, shmem, stream);

Due to the prefixed kernel stub name, currently the users have to use

  void __device_stub_foo(int*);
  hipLaunchKernel(__device_stub_foo, grids, blocks, args, shmem, stream);


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-25 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D86376#2236501 , @yaxunl wrote:

> My previous measurements did not warming up, which caused some one time 
> overhead due to device initialization and loading of device binary. With warm 
> up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 
> us. Based on the trace from rocprofile, the time spent inside these functions 
> can be ignored. Most of the time is spent making the calls. These functions 
> stay in a shared library, which may be the reason why they take such long 
> time. Making them always_inline may get rid of the overhead, however, that 
> would require exposing internal data structures.

It's still suspiciously high. AFAICT, config/push/pull is just an std::vector 
push/pop. It should not take *that* long.  Few function calls should not lead 
to microseconds of overhead, once linker has resolved the symbol, if they come 
from a shared library.
https://github.com/ROCm-Developer-Tools/HIP/blob/master/vdi/hip_platform.cpp#L590

I wonder if it's the logging facilities that add all this overhead.

> The kernel launching latency are measured by a simple loop in which a simple 
> kernel is launched then hipStreamSynchronize is called. trace is collected by 
> rocprofiler and the latency is measured from the end of hipStreamSynchronize 
> to the real start of kernel execution. Without this patch, the latency is 
> about 77 us. With this patch, the latency is about 46 us. The improvement is 
> about 40%. The decrement of 31 us is more than 19 us since it also eliminates 
> the overhead of kernel stub.

This is rather surprising. A function call by itself does *not* have such high 
overhead. There must be something else. I strongly suspect logging. If you 
remove logging statements from push/pop without changing anything else, how 
does that affect performance?

>>> I would like to say the motivation of this change is two folds: 1. improve 
>>> latency 2. interoperability with C++ programs.
>>
>> Could you elaborate on the "interoperability with C++ programs"? I don't 
>> think I see how this patch helps with that. Or what exactly is the issue 
>> with C++ interoperability we have now?
>
> In HIP program, a global symbol is generated in host binary to identify each 
> kernel. This symbol is associated with the device kernel by a call of 
> hipRegisterFunction in init functions. Each time the kernel needs to be 
> called, the associated symbol is passed to hipLaunchKernel. In host code, 
> this symbol represents the kernel. Let's call it the kernel symbol. Currently 
> it is the kernel stub function, however, it could be any global symbol, as 
> long as it is registered with hipRegisterFunction, then hipLaunchKernel can 
> use it to find the right kernel and launch it.

So far so good, it matches the way CUDA does that.

> In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the 
> kernel symbol.

Do you mean the host-side symbol, registered with the runtime that you've 
described above? Or do you mean that the device-side symbol is somehow visible 
from the host side. I think that's where HIP is different from CUDA.

> Since the kernel symbol is defined in object files generated from HIP. 
> For C/C++ program, as long as it declares the kernel symbol as an external 
> function or variable which matches the name of the original symbol, the 
> linker will resolve to the correct kernel symbol, then the correct kernel can 
> be launched.

The first sentence looks incomplete. It seems to imply that hipLaunchKernel 
uses the device-side kernel symbol and it's the linker which ties host-side 
reference with device-side symbol. If that's the case, then I don't understand 
what purpose is served by hipRegisterFunction. AFAICT, it's not used in this 
scenario at all.

My mental model of kernel launch mechanics looks like this:

- For a kernel foo, there is a host-side symbol (it's the stub for CUDA) with 
the name 'foo' and device-side real kernel 'foo'.
- host side linker has no access to device-side symbols, but we do need to 
associate host and device side 'foo' instances.
- address of host-side foo is registered with runtime to map it to device 
symbol with the name 'foo'
- when a kernel is launched, call site sets up launch config and calls the 
stub, passing it the kernel arguments.
- the stub calls the kernel launch function, and passes host-side foo address 
to the kernel launch function
- launch function finds device-side symbol name via the registration info and 
does device-side address lookup to obtain it's device address
- run device-side function.

In this scenario, the host-side stub for foo is a regular function, which gdb 
can stop on and examine kernel arguments.

How is the process different for HIP? I know that we've changed the stub name 
to avoid debugger confusion about which if the entities corresponds to 'foo'.

> Here comes the nuance with kernel stub function as the kernel 

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-25 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D86376#2234824 , @tra wrote:

> In D86376#2234719 , @yaxunl wrote:
>
>>> This patch appears to be somewhere in the gray area to me. My prior 
>>> experience with CUDA suggests that it will make little to no difference. On 
>>> the other hand, AMD GPUs may be different enough to prove me wrong. Without 
>>> specific evidence, I still can't tell what's the case here.
>>
>> Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 
>> 60 us. The typical kernel launching latency is about 500us, therefore the 
>> improvement is around 10%.
>
> 60 *micro seconds* to store/load something from memory? It does not sound 
> right. 0.5 millisecond per kernel launch is also suspiciously high. 
> For CUDA it's ~5us 
> (https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). 
> If it does indeed take 60 microseconds to push/pop a O(cacheline) worth of 
> launch config data, the implementation may be doing something wrong. We're 
> talking about O(100) syscalls and that's way too much work for something that 
> simple. What do those calls do?
>
> Can you confirm that the units are indeed microseconds and not nanoseconds?

My previous measurements did not warming up, which caused some one time 
overhead due to device initialization and loading of device binary. With warm 
up, the call of `__hipPushCallConfigure/__hipPopCallConfigure` takes about 19 
us. Based on the trace from rocprofile, the time spent inside these functions 
can be ignored. Most of the time is spent making the calls. These functions 
stay in a shared library, which may be the reason why they take such long time. 
Making them always_inline may get rid of the overhead, however, that would 
require exposing internal data structures.

The kernel launching latency are measured by a simple loop in which a simple 
kernel is launched then hipStreamSynchronize is called. trace is collected by 
rocprofiler and the latency is measured from the end of hipStreamSynchronize to 
the real start of kernel execution. Without this patch, the latency is about 77 
us. With this patch, the latency is about 46 us. The improvement is about 40%. 
The decrement of 31 us is more than 19 us since it also eliminates the overhead 
of kernel stub.

>> I would like to say the motivation of this change is two folds: 1. improve 
>> latency 2. interoperability with C++ programs.
>
> Could you elaborate on the "interoperability with C++ programs"? I don't 
> think I see how this patch helps with that. Or what exactly is the issue with 
> C++ interoperability we have now?

In HIP program, a global symbol is generated in host binary to identify each 
kernel. This symbol is associated with the device kernel by a call of 
hipRegisterFunction in init functions. Each time the kernel needs to be called, 
the associated symbol is passed to hipLaunchKernel. In host code, this symbol 
represents the kernel. Let's call it the kernel symbol. Currently it is the 
kernel stub function, however, it could be any global symbol, as long as it is 
registered with hipRegisterFunction, then hipLaunchKernel can use it to find 
the right kernel and launch it.

In a C/C++ program, a kernel is launched by call of hipLaunchKernel with the 
kernel symbol. Since the kernel symbol is defined in object files generated 
from HIP. For C/C++ program, as long as it declares the kernel symbol as an 
external function or variable which matches the name of the original symbol, 
the linker will resolve to the correct kernel symbol, then the correct kernel 
can be launched.

Here comes the nuance with kernel stub function as the kernel symbol. If you 
still remember, there was a previous patch for HIP to change the kernel stub 
name. rocgdb requires the device stub to have a different name than the real 
kernel, since otherwise it will not be able to break on the real kernel only. 
As a result, the kernel stub now has a prefix `__device_stub_` before mangling.

For example, a kernel `foo` will have a kernel stub with name 
`__device_stub_foo`.

For a C/C++ program to call kernel `foo`, it needs to declare an external 
symbol `__device_stub_foo` then launch it. Of course this is an annoyance for 
C/C++ users, especially this involves mangled names.

However, we cannot change the name of the kernel stub to be the same as the 
kernel, since that will break rocgdb.

Now the solution is to get rid of the kernel stub function. Instead of use 
kernel stub function as kernel symbol, we will emit a global variable as kernel 
symbol. This global variable can have the same name as the kernel, since rocgdb 
will not break on it.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D86376#2234719 , @yaxunl wrote:

>> This patch appears to be somewhere in the gray area to me. My prior 
>> experience with CUDA suggests that it will make little to no difference. On 
>> the other hand, AMD GPUs may be different enough to prove me wrong. Without 
>> specific evidence, I still can't tell what's the case here.
>
> Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 
> us. The typical kernel launching latency is about 500us, therefore the 
> improvement is around 10%.

60 *micro seconds* to store/load something from memory? It does not sound 
right. 0.5 millisecond per kernel launch is also suspiciously high. 
For CUDA it's ~5us 
(https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf). If 
it does indeed take 60 microseconds to push/pop a O(cacheline) worth of launch 
config data, the implementation may be doing something wrong. We're talking 
about O(100) syscalls and that's way too much work for something that simple. 
What do those calls do?

Can you confirm that the units are indeed microseconds and not nanoseconds?

> To run HIP applications, users need to install ROCm, which includes rocgdb.

I would disagree with that assertion. I do very much want to build a 
Tensorflow-based app and run it in a container with nothing else but the app 
and I do want to use existing infrastructure to capture relevant info if the 
app crashes. Such capture will not be using any HIP-specific tools. 
Or I could give it to a user who absolutely does not care what's inside the 
executable, but who may want to run it under gdb if something goes wrong.

> A debugger without device code debugging capability has little use with HIP 
> applications therefore I would expect users to always use rocgdb to debug HIP 
> program.

I agree that it's indeed the case if someone wants/needs to debug GPU code, 
however, in many cases it's sufficient to be able to debug host-side things 
only. And it is useful to see the point where we launch kernels and be able to 
tell which kernel it was.

> Also, since clang already removed all debug information for kernel stub, gdb 
> cannot break on kernel stub any way.

gdb is aware of the ELF symbols and those are often exposed in shared 
libraries. While you will not have type info, etc, you can still set a 
breakpoint and get a sensible stack trace in many cases. We usually build with 
some amount of debug info and it did prove rather helpful to pin-point GPU 
failures via host-side stack trace as it did include the symbol name of the 
host-side stub which allows identifying the device-side kernel. If all we see 
in the stack trace is `hipLaunchKernel`, it would be considerably less helpful, 
especially when there's no detailed debug info which would allow us to dig out 
the kernel name from its arguments. All we'd know that we've launched *some* 
kernel.

>> Is eliminating the host-side stub the goal, or just a coincidental 
>> side-effect? I.e. if it's something you *need* to do, then the discussion 
>> about minor performance gain becomes rather irrelevant and we should weigh 
>> 'improvements in HIP debugging' vs 'regression in host-only debugging' 
>> instead.
>
> I would like to say the motivation of this change is two folds: 1. improve 
> latency 2. interoperability with C++ programs.

Could you elaborate on the "interoperability with C++ programs"? I don't think 
I see how this patch helps with that. Or what exactly is the issue with C++ 
interoperability we have now?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.



In D86376#2234547 , @tra wrote:

> I'm OK with how the patch is implemented.
> I'm still on the fence regarding whether it should be implemented.
>
> In D86376#2234458 , @yaxunl wrote:
>
>> `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 
>> ns overhead, whereas we have requests to squeeze any overhead in kernel 
>> launching latency.
>
> That's about the same as 1 cache miss. I'm willing to bet that it will be 
> lost in the noise. Are there any real world benchmarks where it makes a 
> difference?
> Are those requests driven by a specific use case? Not all requests (even well 
> intentioned ones) are worth implementing. 
> This patch appears to be somewhere in the gray area to me. My prior 
> experience with CUDA suggests that it will make little to no difference. On 
> the other hand, AMD GPUs may be different enough to prove me wrong. Without 
> specific evidence, I still can't tell what's the case here.

Sorry, the overhead due to `__hipPushConfigure/__hipPopConfigure` is about 60 
us. The typical kernel launching latency is about 500us, therefore the 
improvement is around 10%.

>>> One side effect of this patch is that there will be no convenient way to 
>>> set host-side breakpoint on kernel launch.
>>> Another will be that examining call stack will become somewhat confusing as 
>>> the arguments passed to the kernel as written in the source code will not 
>>> match those observed in the stack trace. I guess preserving the appearance 
>>> of normal function calls was the reason for the split  config setup/kernel 
>>> launch in CUDA.  I'd say it's still useful to have as CUDA-specific 
>>> debugger is not always available and one must use regular gdb on CUDA apps 
>>> now and then.
>>
>> Eliminating kernel stub does not affect debugability negatively. At least 
>> this is true for HIP debugger. Actually our debugger team intentionally 
>> requests to eliminate any debug information for the kernel stub so that it 
>> will not confuse the debugger with the real kernel. This is because the 
>> kernel stub is an artificial function for launching the kernel, not the real 
>> kernel which is in device binary. For HIP debugger (rocmgdb), when the user 
>> set break point on a kernel, it will break on the real kernel in device 
>> binary, and the call stack are displayed correctly. The arguments to the 
>> real kernel are not lost, since the real kernel is a real function in device 
>> binary.
>
> You appear to assume debuggability with HIP-aware debugger. That part I'm not 
> particularly concerned about as I assume that it will be tested on AMD's side.
> I was mostly concerned about debuggability with the ordinary gdb. Imagine 
> someone having to debug a TF app they've got somewhere. The end user may not 
> even have HIP tools installed. It would be useful to be able to debug until 
> the point where control is passed to the GPU. The patch will likely have a 
> minor, but still negative impact on that.
>
> I guess one should still be able to set a breakpoint using the `file:line 
> number`. If you could verify that it still works with gdb, that would be a 
> reasonable workaround. I think we still need to have some way to set a 
> breakpoint on the kernel launch site (I think it should still work) and on 
> the kernel entry.

To run HIP applications, users need to install ROCm, which includes rocgdb. A 
debugger without device code debugging capability has little use with HIP 
applications therefore I would expect users to always use rocgdb to debug HIP 
program. Also, since clang already removed all debug information for kernel 
stub, gdb cannot break on kernel stub any way.

>> Another motivation for eliminating kernel stub is to be able to emit a 
>> symbol with the same mangled name as a kernel as a global variable instead 
>> of a function. Since we need such symbols to be able to launch kernels with 
>> mangled name in a C++ program. If we use kernel stub as the symbol, we 
>> cannot use the original mangled kernel name since our debugger does not 
>> allow that.
>
> Is eliminating the host-side stub the goal, or just a coincidental 
> side-effect? I.e. if it's something you *need* to do, then the discussion 
> about minor performance gain becomes rather irrelevant and we should weigh 
> 'improvements in HIP debugging' vs 'regression in host-only debugging' 
> instead.

I would like to say the motivation of this change is two folds: 1. improve 
latency 2. interoperability with C++ programs.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

I'm OK with how the patch is implemented.
I'm still on the fence regarding whether it should be implemented.

In D86376#2234458 , @yaxunl wrote:

> `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 
> ns overhead, whereas we have requests to squeeze any overhead in kernel 
> launching latency.

That's about the same as 1 cache miss. I'm willing to bet that it will be lost 
in the noise. Are there any real world benchmarks where it makes a difference?
Are those requests driven by a specific use case? Not all requests (even well 
intentioned ones) are worth implementing. 
This patch appears to be somewhere in the gray area to me. My prior experience 
with CUDA suggests that it will make little to no difference. On the other 
hand, AMD GPUs may be different enough to prove me wrong. Without specific 
evidence, I still can't tell what's the case here.

>> One side effect of this patch is that there will be no convenient way to set 
>> host-side breakpoint on kernel launch.
>> Another will be that examining call stack will become somewhat confusing as 
>> the arguments passed to the kernel as written in the source code will not 
>> match those observed in the stack trace. I guess preserving the appearance 
>> of normal function calls was the reason for the split  config setup/kernel 
>> launch in CUDA.  I'd say it's still useful to have as CUDA-specific debugger 
>> is not always available and one must use regular gdb on CUDA apps now and 
>> then.
>
> Eliminating kernel stub does not affect debugability negatively. At least 
> this is true for HIP debugger. Actually our debugger team intentionally 
> requests to eliminate any debug information for the kernel stub so that it 
> will not confuse the debugger with the real kernel. This is because the 
> kernel stub is an artificial function for launching the kernel, not the real 
> kernel which is in device binary. For HIP debugger (rocmgdb), when the user 
> set break point on a kernel, it will break on the real kernel in device 
> binary, and the call stack are displayed correctly. The arguments to the real 
> kernel are not lost, since the real kernel is a real function in device 
> binary.

You appear to assume debuggability with HIP-aware debugger. That part I'm not 
particularly concerned about as I assume that it will be tested on AMD's side.
I was mostly concerned about debuggability with the ordinary gdb. Imagine 
someone having to debug a TF app they've got somewhere. The end user may not 
even have HIP tools installed. It would be useful to be able to debug until the 
point where control is passed to the GPU. The patch will likely have a minor, 
but still negative impact on that.

I guess one should still be able to set a breakpoint using the `file:line 
number`. If you could verify that it still works with gdb, that would be a 
reasonable workaround. I think we still need to have some way to set a 
breakpoint on the kernel launch site (I think it should still work) and on the 
kernel entry.

So, we have a trade-off of minor performance gain vs a minor debuggability 
regression. I don't have strong opinions which is the best way to go. By 
default, with no demonstrated benefit, I'd err on the side of not changing 
things.

> Another motivation for eliminating kernel stub is to be able to emit a symbol 
> with the same mangled name as a kernel as a global variable instead of a 
> function. Since we need such symbols to be able to launch kernels with 
> mangled name in a C++ program. If we use kernel stub as the symbol, we cannot 
> use the original mangled kernel name since our debugger does not allow that.

Is eliminating the host-side stub the goal, or just a coincidental side-effect? 
I.e. if it's something you *need* to do, then the discussion about minor 
performance gain becomes rather irrelevant and we should weigh 'improvements in 
HIP debugging' vs 'regression in host-only debugging' instead.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D86376#2234259 , @tra wrote:

> How much does this inlining buy you in practice? I.e. what's a typical launch 
> latency before/after the patch? For CUDA, config push/pop is negligible 
> compared to the cost of actually launching the kernel on the GPU. It is 
> measurable if the launch is asynchronous, but queueing kernels fast, does not 
> help all that much in the long run -- you eventually have to run those 
> kernels on the GPU, so in most cases you're just spend a bit more time idling 
> while waiting for the queued kernels to finish. To be beneficial, you'll need 
> a finely balanced CPU/GPU workload and that's rather hard to achieve. Not to 
> the point where the minor savings here would be meaningful. I would assume 
> the situation on AMD GPUs is not that different.

`__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 ns 
overhead, whereas we have requests to squeeze any overhead in kernel launching 
latency.

> One side effect of this patch is that there will be no convenient way to set 
> host-side breakpoint on kernel launch.
> Another will be that examining call stack will become somewhat confusing as 
> the arguments passed to the kernel as written in the source code will not 
> match those observed in the stack trace. I guess preserving the appearance of 
> normal function calls was the reason for the split  config setup/kernel 
> launch in CUDA.  I'd say it's still useful to have as CUDA-specific debugger 
> is not always available and one must use regular gdb on CUDA apps now and 
> then.

Eliminating kernel stub does not affect debugability negatively. At least this 
is true for HIP debugger. Actually our debugger team intentionally requests to 
eliminate any debug information for the kernel stub so that it will not confuse 
the debugger with the real kernel. This is because the kernel stub is an 
artificial function for launching the kernel, not the real kernel which is in 
device binary. For HIP debugger (rocmgdb), when the user set break point on a 
kernel, it will break on the real kernel in device binary, and the call stack 
are displayed correctly. The arguments to the real kernel are not lost, since 
the real kernel is a real function in device binary.

Another motivation for eliminating kernel stub is to be able to emit a symbol 
with the same mangled name as a kernel as a global variable instead of a 
function. Since we need such symbols to be able to launch kernels with mangled 
name in a C++ program. If we use kernel stub as the symbol, we cannot use the 
original mangled kernel name since our debugger does not allow that.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

How much does this inlining buy you in practice? I.e. what's a typical launch 
latency before/after the patch? For CUDA, config push/pop is negligible 
compared to the cost of actually launching the kernel on the GPU. It is 
measurable if the launch is asynchronous, but queueing kernels fast, does not 
help all that much in the long run -- you eventually have to run those kernels 
on the GPU, so in most cases you're just spend a bit more time idling while 
waiting for the queued kernels to finish. To be beneficial, you'll need a 
finely balanced CPU/GPU workload and that's rather hard to achieve. Not to the 
point where the minor savings here would be meaningful. I would assume the 
situation on AMD GPUs is not that different.

One side effect of this patch is that there will be no convenient way to set 
host-side breakpoint on kernel launch.
Another will be that examining call stack will become somewhat confusing as the 
arguments passed to the kernel as written in the source code will not match 
those observed in the stack trace. I guess preserving the appearance of normal 
function calls was the reason for the split  config setup/kernel launch in 
CUDA.  I'd say it's still useful to have as CUDA-specific debugger is not 
always available and one must use regular gdb on CUDA apps now and then.

If the patch does give measurable performance improvement, can we implement 
launch config push/pop in a way that compiler can eliminate by itself when it's 
possible and keep the stub as the host-side kernel entry point? I would prefer 
to avoid sacrificing debugging usability for performance optimizations that may 
not matter.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86376/new/

https://reviews.llvm.org/D86376

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-21 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl created this revision.
yaxunl added reviewers: tra, rjmccall.
yaxunl requested review of this revision.

Currently clang emits emits the following code for triple chevron kernel call 
for HIP:

  __hipPushCallConfiguration(grids, blocks, shmem, stream);
  kernel_stub();

whereas for each kernel, clang emits a kernel_stub:

  void kernel_stub() {
__hipPopCallConfiguration(&grids, &blocks, &shmem, &stream);
hipLaunchKernel(kernel_stub, grids, blocks, kernel_args, shmem, stream);
  }

This is really unnecessary. in host code, a kernel function is not really a 
"function"
since you cannot "call" it in the generated IR, you can only launch it through 
kernel
launching API.

This patch simplifies the generated code for kernel launching by eliminating the
call of `__hipPushCallConfiguration` and `__hipPopCallConfiguration`. For each
triple chevron, a call of `hipLaunchKernel` is directly emitted. The kernel stub
function is still emitted as an empty function, for the sole purpose of as a 
shadow
symbol to map to the device symbol in device binary so that runtime can use it
to find the device symbol.

This patch does not change AST for kernel since semantically a triple chevron
is like a function call. Keep it as a function call facilitates overloading 
resolution
and function argument type checking.

This patch only changes kernel launching codegen for HIP for the new kernel 
launching
API since we are sure there is no other side effect in 
`__hipPushCallConfiguration`
and `__hipPopCallConfiguration`.


https://reviews.llvm.org/D86376

Files:
  clang/lib/CodeGen/CGCUDANV.cpp
  clang/test/CodeGenCUDA/kernel-call.cu
  clang/test/CodeGenCUDA/kernel-call.hip
  clang/test/lit.cfg.py

Index: clang/test/lit.cfg.py
===
--- clang/test/lit.cfg.py
+++ clang/test/lit.cfg.py
@@ -25,7 +25,7 @@
 config.test_format = lit.formats.ShTest(not llvm_config.use_lit_shell)
 
 # suffixes: A list of file extensions to treat as test files.
-config.suffixes = ['.c', '.cpp', '.i', '.cppm', '.m', '.mm', '.cu',
+config.suffixes = ['.c', '.cpp', '.i', '.cppm', '.m', '.mm', '.cu', '.hip',
'.ll', '.cl', '.s', '.S', '.modulemap', '.test', '.rs', '.ifs']
 
 # excludes: A list of directories to exclude from the testsuite. The 'Inputs'
Index: clang/test/CodeGenCUDA/kernel-call.hip
===
--- /dev/null
+++ clang/test/CodeGenCUDA/kernel-call.hip
@@ -0,0 +1,69 @@
+// RUN: %clang_cc1 -fhip-new-launch-api -triple x86_64-unknown-linux-gnu \
+// RUN:   -std=c++11 -emit-llvm %s -o - | FileCheck %s
+
+#include "Inputs/cuda.h"
+
+struct A { int a[10]; };
+
+__global__ void g1(int x) {}
+__global__ void g2(A x) {}
+__global__ void g3(A &x) {}
+template __global__ void g4(F f, int *x) { *x = f(); }
+void (*pg1)(int x) = g1;
+
+// CHECK-LABEL: define{{.*}}test1
+void test1() {
+  // CHECK: call void @_ZN4dim3C1Ejjj(%struct.dim3* {{.*}}, i32 2, i32 1, i32 1)
+  // CHECK: call void @_ZN4dim3C1Ejjj(%struct.dim3* {{.*}}, i32 3, i32 1, i32 1)
+  // CHECK: call i32 @hipLaunchKernel({{.*}}@_Z17__device_stub__g1i{{.*}}, i64 0, %struct.hipStream* null)
+  g1<<<2, 3>>>(0);
+
+  // CHECK: call void @_ZN4dim3C1Ejjj(%struct.dim3* {{.*}}, i32 4, i32 5, i32 6)
+  // CHECK: call void @_ZN4dim3C1Ejjj(%struct.dim3* {{.*}}, i32 7, i32 8, i32 9)
+  // CHECK: call i32 @hipLaunchKernel({{.*}}@_Z17__device_stub__g1i{{.*}}, i64 10, {{.*}}inttoptr (i64 11
+  g1<<>>(0);
+
+  // CHECK: %[[LD:.*]] = load void (i32)*, void (i32)** @pg1
+  // CHECK: %[[PTR:.*]] = bitcast void (i32)* %[[LD]] to i8*
+  // CHECK: call i32 @hipLaunchKernel({{.*}}%[[PTR]]{{.*}}, i64 0, %struct.hipStream* null)
+  pg1<<<1, 1>>>(0);
+}
+
+// CHECK-LABEL: define{{.*}}test2
+void test2() {
+  A a;
+  // CHECK: %agg.tmp = alloca %struct.A, align 4
+  // CHECK: %kernel_args = alloca i8*, i64 1, align 16
+  // CHECK: %[[CAST:.*]] = bitcast %struct.A* %agg.tmp to i8*
+  // CHECK: %[[GEP:.*]] = getelementptr i8*, i8** %kernel_args, i32 0
+  // CHECK: store i8* %[[CAST]], i8** %[[GEP]], align 8
+  // CHECK: call i32 @hipLaunchKernel({{.*}}@_Z17__device_stub__g21A{{.*}}, i64 0, %struct.hipStream* null)
+  g2<<<1, 1>>>(a);
+}
+
+// CHECK-LABEL: define{{.*}}test3
+void test3() {
+  A a;
+  // CHECK: %a = alloca %struct.A, align 4
+  // CHECK: %kernel_arg = alloca %struct.A*, align 8
+  // CHECK: %kernel_args = alloca i8*, i64 1, align 16
+  // CHECK: store %struct.A* %a, %struct.A** %kernel_arg, align 8
+  // CHECK: %[[CAST:.*]] = bitcast %struct.A** %kernel_arg to i8*
+  // CHECK: %[[GEP:.*]] = getelementptr i8*, i8** %kernel_args, i32 0
+  // CHECK: store i8* %[[CAST]], i8** %[[GEP]], align 8
+  // CHECK: call i32 @hipLaunchKernel({{.*}}@_Z17__device_stub__g3R1A{{.*}}, i64 0, %struct.hipStream* null)
+  g3<<<1, 1>>>(a);
+}
+
+// CHECK-LABEL: define{{.*}}test4
+void test4() {
+  int x = 123;
+  int y;
+  // CHECK: %agg.tmp = alloca %class.anon, align 4
+  // CHECK: %k