[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-24 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl marked an inline comment as done.
yaxunl added a comment.

In D101630#2748513 , @tra wrote:

> How about this:
> If the user explicitly specified `--cuda-host-only` or `--cuda-device-only`, 
> then by default only allow producing the natural output format, unless a 
> bundled output is requested by an option. This should keep existing users 
> working.
> If the compilation is done without explicitly requested sub-compilation(s), 
> then bundle the output by default. This should keep the GPU-unaware tools 
> like ccache happy as they would always get the single output they expect.
>
> WDYT?

`--cuda-host-only` always have one output, therefore there is no point of 
bundle its output. We only need to decide the proper behavior of 
`--cuda-device-only`.

How about keeping the original default behavior of not bundling if users do not 
specify output file, whereas bundle the output if users specifying output file. 
Since specifying output file indicates users requesting one output. 
-f[no-]hip-bundle-device-output override the default behavior.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-24 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl updated this revision to Diff 347400.
yaxunl added a comment.

fixed option. bundle output if users specify output by -o or -E


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/clang-offload-bundler.c
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-output-file-name.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip
  clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp

Index: clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
===
--- clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
+++ clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
@@ -117,6 +117,9 @@
 /// The index of the host input in the list of inputs.
 static unsigned HostInputIndex = ~0u;
 
+/// Whether not having host target is allowed.
+static bool AllowNoHost = false;
+
 /// Path to the current binary.
 static std::string BundlerExecutable;
 
@@ -839,9 +842,10 @@
   }
 
   // Get the file handler. We use the host buffer as reference.
-  assert(HostInputIndex != ~0u && "Host input index undefined??");
+  assert((HostInputIndex != ~0u || AllowNoHost) &&
+ "Host input index undefined??");
   Expected> FileHandlerOrErr =
-  CreateFileHandler(*InputBuffers[HostInputIndex]);
+  CreateFileHandler(*InputBuffers[AllowNoHost ? 0 : HostInputIndex]);
   if (!FileHandlerOrErr)
 return FileHandlerOrErr.takeError();
 
@@ -1108,6 +1112,7 @@
   // have exactly one host target.
   unsigned Index = 0u;
   unsigned HostTargetNum = 0u;
+  bool HIPOnly = true;
   llvm::DenseSet ParsedTargets;
   for (StringRef Target : TargetNames) {
 if (ParsedTargets.contains(Target)) {
@@ -1149,12 +1154,21 @@
   HostInputIndex = Index;
 }
 
+if (Kind != "hip" && Kind != "hipv4")
+  HIPOnly = false;
+
 ++Index;
   }
 
+  // HIP uses clang-offload-bundler to bundle device-only compilation results
+  // for multiple GPU archs, therefore allow no host target if all entries
+  // are for HIP.
+  AllowNoHost = HIPOnly;
+
   // Host triple is not really needed for unbundling operation, so do not
   // treat missing host triple as error if we do unbundling.
-  if ((Unbundle && HostTargetNum > 1) || (!Unbundle && HostTargetNum != 1)) {
+  if ((Unbundle && HostTargetNum > 1) ||
+  (!Unbundle && HostTargetNum != 1 && !AllowNoHost)) {
 reportError(createStringError(errc::invalid_argument,
   "expecting exactly one host target but got " +
   Twine(HostTargetNum)));
Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -6,7 +6,7 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -c -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip -fhip-bundle-device-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
 // With `-emit-llvm`, the output should be the same as the aforementioned line
@@ -16,14 +16,14 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -c -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip -fhip-bundle-device-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITBC %s
 
 // RUN: %clang -### -target x86_64-linux-gnu \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip -fhip-bundle-device-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITLL %s
 
 // With `-emit-llvm`, the output should be the same as the aforementioned line
@@ -33,7 +33,7 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -emit-llvm -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_multiple_inputs/b.hip \
+// RUN:   %S/Inputs/hip_multiple_inputs/b.hip -fhip-bundle-device-output \
 // RUN: 2>&1 | FileCheck -check-prefixes=COMMON,EMITLL %s
 
 // With `-save-temps`, commane lines for each steps are dumped. For assembly
@@ -44,9 +44,17 @@
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   -S -nogpuinc -nogpulib --cuda-device-only -fgpu-rdc \
 // RUN:   %S/Inputs/hip_multiple_inputs/a.cu \
-// RUN:   %S/Inputs/hip_mul

[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a subscriber: echristo.
tra added a comment.

In D101630#2777346 , @yaxunl wrote:

> In D101630#2748513 , @tra wrote:
>
>> How about this:
>> If the user explicitly specified `--cuda-host-only` or `--cuda-device-only`, 
>> then by default only allow producing the natural output format, unless a 
>> bundled output is requested by an option. This should keep existing users 
>> working.
>> If the compilation is done without explicitly requested sub-compilation(s), 
>> then bundle the output by default. This should keep the GPU-unaware tools 
>> like ccache happy as they would always get the single output they expect.
>>
>> WDYT?
>
> `--cuda-host-only` always have one output, therefore there is no point of 
> bundle its output. We only need to decide the proper behavior of 
> `--cuda-device-only`.

It still fits my proposal of requiring a single sub-compilation and not 
bundling the output.
The point was that such behavior is consistent regardless of whether we're 
compiling CUDA or HIP for the host or for device.

> How about keeping the original default behavior of not bundling if users do 
> not specify output file, 
> whereas bundle the output if users specifying output file.

I think it will make things worse. Compiler output should not change depending 
on whether `-o` is used.

> Since specifying output file indicates users  requesting one output. 
> -f[no-]hip-bundle-device-output override the default behavior.

I disagree. When user specifies the output, the intent is to specify the 
**location** of the outputs, not its contents or format.

Telling compiler to produce a different output format should not depend on 
specifying (or not) the output location.

I think our options are:

- Always bundle --cuda-device-only outputs by default. This is consistent for 
HIP compilation, but deviates from CUDA, which can't do bundling. Also, 
single-target subcompilation deviates from both CUDA and regular C++ 
compilation, which is what most users would be familiar with and which would 
probably be the most sensible default for a single sub-compilation. It can be 
overridden with an option, but it goes against the principle that it's 
specialized use case that should need extra options. The most common use case 
should not need them.

- Only bundle multiple sub-compilations' output by default. This would preserve 
the sensible single sub-compilation behavior. The downside is that it makes the 
output format depend on whether compiler ends up doing one or many 
sub-compilations. E.g. `--offload-arch=A -S` would produce ASM and 
`--offload-arch=A --offload-arch=B -S` would produce a bundle. If the user 
can't control some of the compiler options, Such approach would make output 
format unpredictable. E.g. passing `--offload-arch=foo` to compiler on godbolt 
would all of a sudden produce bundled output instead of assembly text or a 
sensible error message that you're trying to produce multiple outputs.

- Keep the current behavior (insist on single sub-compilation) as the default, 
allow overriding it for HIP with the flag. IMO that's the most consistent 
option and I still think it's the one most suitable to keep as the default.

I can see the benefit of always bundling for HIP, but I also believe that 
keeping things simple, consistent and predictable is important. Considering 
that we're tinkering in a relatively obscure niche of the compiler, it probably 
does not matter all that much, but it should not stop us from trying to figure 
out the best approach in a principled way.

I think we could benefit from a second opinion on which approach would make 
more sense for clang. 
Summoning @jdoerfert and @echristo.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-28 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#202 , @tra wrote:

> In D101630#2777346 , @yaxunl wrote:
>
>> In D101630#2748513 , @tra wrote:
>>
>>> How about this:
>>> If the user explicitly specified `--cuda-host-only` or 
>>> `--cuda-device-only`, then by default only allow producing the natural 
>>> output format, unless a bundled output is requested by an option. This 
>>> should keep existing users working.
>>> If the compilation is done without explicitly requested sub-compilation(s), 
>>> then bundle the output by default. This should keep the GPU-unaware tools 
>>> like ccache happy as they would always get the single output they expect.
>>>
>>> WDYT?
>>
>> `--cuda-host-only` always have one output, therefore there is no point of 
>> bundle its output. We only need to decide the proper behavior of 
>> `--cuda-device-only`.
>
> It still fits my proposal of requiring a single sub-compilation and not 
> bundling the output.
> The point was that such behavior is consistent regardless of whether we're 
> compiling CUDA or HIP for the host or for device.
>
>> How about keeping the original default behavior of not bundling if users do 
>> not specify output file, 
>> whereas bundle the output if users specifying output file.
>
> I think it will make things worse. Compiler output should not change 
> depending on whether `-o` is used.
>
>> Since specifying output file indicates users  requesting one output. 
>> -f[no-]hip-bundle-device-output override the default behavior.
>
> I disagree. When user specifies the output, the intent is to specify the 
> **location** of the outputs, not its contents or format.
>
> Telling compiler to produce a different output format should not depend on 
> specifying (or not) the output location.
>
> I think our options are:
>
> - Always bundle --cuda-device-only outputs by default. This is consistent for 
> HIP compilation, but deviates from CUDA, which can't do bundling. Also, 
> single-target subcompilation deviates from both CUDA and regular C++ 
> compilation, which is what most users would be familiar with and which would 
> probably be the most sensible default for a single sub-compilation. It can be 
> overridden with an option, but it goes against the principle that it's 
> specialized use case that should need extra options. The most common use case 
> should not need them.
>
> - Only bundle multiple sub-compilations' output by default. This would 
> preserve the sensible single sub-compilation behavior. The downside is that 
> it makes the output format depend on whether compiler ends up doing one or 
> many sub-compilations. E.g. `--offload-arch=A -S` would produce ASM and 
> `--offload-arch=A --offload-arch=B -S` would produce a bundle. If the user 
> can't control some of the compiler options, Such approach would make output 
> format unpredictable. E.g. passing `--offload-arch=foo` to compiler on 
> godbolt would all of a sudden produce bundled output instead of assembly text 
> or a sensible error message that you're trying to produce multiple outputs.
>
> - Keep the current behavior (insist on single sub-compilation) as the 
> default, allow overriding it for HIP with the flag. IMO that's the most 
> consistent option and I still think it's the one most suitable to keep as the 
> default.
>
> I can see the benefit of always bundling for HIP, but I also believe that 
> keeping things simple, consistent and predictable is important. Considering 
> that we're tinkering in a relatively obscure niche of the compiler, it 
> probably does not matter all that much, but it should not stop us from trying 
> to figure out the best approach in a principled way.
>
> I think we could benefit from a second opinion on which approach would make 
> more sense for clang. 
> Summoning @jdoerfert and @echristo.

How does nvcc --genco behave when there are multiple GPU arch's? Does it output 
a fat binary containing multiple ISA's? Also, does it support device-only 
compilation for intermediate outputs?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-01 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D101630#2787714 , @yaxunl wrote:

> How does nvcc --genco behave when there are multiple GPU arch's? Does it 
> output a fat binary containing multiple ISA's? Also, does it support 
> device-only compilation for intermediate outputs?

It does not allow multiple outputs for `-ptx` and `-cubin` compilations, same 
as clang behaves now:

  ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 
-gencode=arch=compute_70,code=sm_70 -ptx foo.cu
  nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for 
multiple GPU architectures

NVCC does allow `-E` with multiple targets, but it does produce output for only 
*one* of them.

NVCC does bundle outputs for multiple GPU variants if `-cubin` is used.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-01 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2791734 , @tra wrote:

> In D101630#2787714 , @yaxunl wrote:
>
>> How does nvcc --genco behave when there are multiple GPU arch's? Does it 
>> output a fat binary containing multiple ISA's? Also, does it support 
>> device-only compilation for intermediate outputs?
>
> It does not allow multiple outputs for `-ptx` and `-cubin` compilations, same 
> as clang behaves now:
>
>   $ ~/local/cuda-11.3/bin/nvcc -gencode=arch=compute_60,code=sm_60 
> -gencode=arch=compute_70,code=sm_70 -ptx foo.cu
>   nvcc fatal   : Option '--ptx (-ptx)' is not allowed when compiling for 
> multiple GPU architectures
>
> NVCC does allow `-E` with multiple targets, but it does produce output for 
> only *one* of them.
>
> NVCC does bundle outputs for multiple GPU variants if `-fatbin` is used.

I think for intermediate outputs e.g. preprocessor expansion, IR, and assembly, 
probably it makes sense not to bundle by default. However, for default action 
(emitting object), we need to bundle by default since it was the old behavior 
and existing HIP apps depend on that. Then we allow -fhip-bundle-device-output 
to override the default behavior.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-01 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D101630#2792052 , @yaxunl wrote:

> I think for intermediate outputs e.g. preprocessor expansion, IR, and 
> assembly, probably it makes sense not to bundle by default.

Agreed.

> However, for default action (emitting object), we need to bundle by default 
> since it was the old behavior and existing HIP apps depend on that.

Existing use is a valid point.
As a counterargument, I would suggest that in a compilation pipeline which does 
include bundling, an object file for one GPU variant *is* an intermediate 
output, similar to the ones you've listed above.

The final product of device-side subcompilations is a bundle. The question is 
`what does "-c" mean?`.  Is it `produce an object file` or `compile till the 
end of the pipeline` ? 
For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it 
would be closer to the former. In general, it would be closer to the latter. 
NVCC side-steps the issue by using a different flags `-cubin/-fatbin` to 
disambiguate between two cases and avoid bolting on CUDA-related semantics on 
the compiler flags that were not designed for that.

> Then we allow -fhip-bundle-device-output to override the default behavior.

OK. Bundling objects for HIP by default looks like a reasonable compromise. 
It would be useful to generalize the flag to `-fgpu-bundle...` as it would be 
useful if/when we want to produce a fatbin during CUDA compilation. I'd still 
keep no-bundling as the default for CUDA's objects.

Now that we are in agreement of what we want, the next question is *how* we 
want to do it.

It appears that there's a fair bit of similarity between what the proposed 
`-fgpu-bundle` flag does and the handful of `--emit-...` options clang has now.
If we were to use something like `--emit-gpu-object` and `--emit-gpu-bundle`, 
it would be similar to NVCC's `-cubin/-fatbinary`, would decouple the default 
behavior for `-c --cuda-device-only` from the user's ability to specify what 
they want without burdening `-c` with additional flags that would have 
different defaults under different circumstances.

Compilation with "-c" would remain the "compile till the end", whatever it 
happens to mean for particular language and `--emit-object/bundle` would tell 
the compiler how far we want it to proceed and what kind of output we want. 
This would probably be easier to explain to the users as they are already 
familiar with flags like `-emit-llvm`, only now we are dealing with an extra 
bundling step in the compilation pipeline. It would also behave consistently 
across CUDA and HIP even though they have different defaults for bundling for 
the device-side compilation. E.g. `-c --cuda-device-only --emit-gpu-bundle` 
will always produce a bundle with the object files for both CUDA and HIP and 
`-c --cuda-device-only --emit-gpu-object` will always require single '-o' 
output.

WDYT? Does it make sense?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-04 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2792160 , @tra wrote:

> In D101630#2792052 , @yaxunl wrote:
>
>> I think for intermediate outputs e.g. preprocessor expansion, IR, and 
>> assembly, probably it makes sense not to bundle by default.
>
> Agreed.
>
>> However, for default action (emitting object), we need to bundle by default 
>> since it was the old behavior and existing HIP apps depend on that.
>
> Existing use is a valid point.
> As a counterargument, I would suggest that in a compilation pipeline which 
> does include bundling, an object file for one GPU variant *is* an 
> intermediate output, similar to the ones you've listed above.
>
> The final product of device-side subcompilations is a bundle. The question is 
> `what does "-c" mean?`.  Is it `produce an object file` or `compile till the 
> end of the pipeline` ? 
> For CUDA and HIP compilation it's ambiguous. When we target just one GPU, it 
> would be closer to the former. In general, it would be closer to the latter. 
> NVCC side-steps the issue by using a different flags `-cubin/-fatbin` to 
> disambiguate between two cases and avoid bolting on CUDA-related semantics on 
> the compiler flags that were not designed for that.
>
>> Then we allow -fhip-bundle-device-output to override the default behavior.
>
> OK. Bundling objects for HIP by default looks like a reasonable compromise. 
> It would be useful to generalize the flag to `-fgpu-bundle...` as it would be 
> useful if/when we want to produce a fatbin during CUDA compilation. I'd still 
> keep no-bundling as the default for CUDA's objects.
>
> Now that we are in agreement of what we want, the next question is *how* we 
> want to do it.
>
> It appears that there's a fair bit of similarity between what the proposed 
> `-fgpu-bundle` flag does and the handful of `--emit-...` options clang has 
> now.
> If we were to use something like `--emit-gpu-object` and `--emit-gpu-bundle`, 
> it would be similar to NVCC's `-cubin/-fatbinary`, would decouple the default 
> behavior for `-c --cuda-device-only` from the user's ability to specify what 
> they want without burdening `-c` with additional flags that would have 
> different defaults under different circumstances.
>
> Compilation with "-c" would remain the "compile till the end", whatever it 
> happens to mean for particular language and `--emit-object/bundle` would tell 
> the compiler how far we want it to proceed and what kind of output we want. 
> This would probably be easier to explain to the users as they are already 
> familiar with flags like `-emit-llvm`, only now we are dealing with an extra 
> bundling step in the compilation pipeline. It would also behave consistently 
> across CUDA and HIP even though they have different defaults for bundling for 
> the device-side compilation. E.g. `-c --cuda-device-only --emit-gpu-bundle` 
> will always produce a bundle with the object files for both CUDA and HIP and 
> `-c --cuda-device-only --emit-gpu-object` will always require single '-o' 
> output.
>
> WDYT? Does it make sense?

For sure we will need -fgpu-bundle-device-output to control bundling of 
intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be 
redundant and can cause confusion. What if users specify `-c 
-fgpu-bundle-device-output -emit-gpu-object` or `-c 
-fno-gpu-bundle-device-output -emit-gpu-bundle`? To me a single option 
-fgpu-bundle-device-output to control all device output seems cleaner.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-04 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D101630#2798975 , @yaxunl wrote:

> For sure we will need -fgpu-bundle-device-output to control bundling of 
> intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be 
> redundant and can cause confusion. What if users specify `-c 
> -fgpu-bundle-device-output -emit-gpu-object` or `-c 
> -fno-gpu-bundle-device-output -emit-gpu-bundle`? To me a single option 
> -fgpu-bundle-device-output to control all device output seems cleaner.

The idea is to use `-emit-gpu-object` and `-emit-gpu-bundle` instead of the  
`-f[no-]gpu-bundle-device-output`. Otherwise they'd do exactly the same thing.

I think `-emit-gpu-{object,bundle}` has a minor edge over 
`-f[no-]gpu-bundle-device-output` as it's similar to other -emit options for 
controlling clang compilation phases (and that's what we want to do here), 
while `-f` options are usually for tweaking code generation.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-04 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2799409 , @tra wrote:

> In D101630#2798975 , @yaxunl wrote:
>
>> For sure we will need -fgpu-bundle-device-output to control bundling of 
>> intermediate files. Then adding -emit-gpu-object and -emit-gpu-bundle may be 
>> redundant and can cause confusion. What if users specify `-c 
>> -fgpu-bundle-device-output -emit-gpu-object` or `-c 
>> -fno-gpu-bundle-device-output -emit-gpu-bundle`? To me a single option 
>> -fgpu-bundle-device-output to control all device output seems cleaner.
>
> The idea is to use `-emit-gpu-object` and `-emit-gpu-bundle` instead of the  
> `-f[no-]gpu-bundle-device-output`. Otherwise they'd do exactly the same thing.
>
> I think `-emit-gpu-{object,bundle}` has a minor edge over 
> `-f[no-]gpu-bundle-device-output` as it's similar to other -emit options for 
> controlling clang compilation phases (and that's what we want to do here), 
> while `-f` options are usually for tweaking code generation.

But how do we control emitting LLVM IR with or without bundle? `-emit-llvm 
-emit-gpu-object` or `-emit-llvm -emit-gpu-bundle`? `-emit-*` is usually for 
specifying a specific file type.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-04 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D101630#2799425 , @yaxunl wrote:

> But how do we control emitting LLVM IR with or without bundle? `-emit-llvm 
> -emit-gpu-object` or `-emit-llvm -emit-gpu-bundle`? `-emit-*` is usually for 
> specifying a specific file type.

Hmm. I forgot that HIP can bundle things other than objects. `-emit-llvm 
-emit-gpu-bundle` looks reasonable, but `-emit-llvm -emit-gpu-object` is indeed 
odd.
OK. Making it some sort of do/do-not bundle flag makes sense. How about just 
`--[no-]gpu-bundle-output`?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-06-04 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2799472 , @tra wrote:

> In D101630#2799425 , @yaxunl wrote:
>
>> But how do we control emitting LLVM IR with or without bundle? `-emit-llvm 
>> -emit-gpu-object` or `-emit-llvm -emit-gpu-bundle`? `-emit-*` is usually for 
>> specifying a specific file type.
>
> Hmm. I forgot that HIP can bundle things other than objects. `-emit-llvm 
> -emit-gpu-bundle` looks reasonable, but `-emit-llvm -emit-gpu-object` is 
> indeed odd.
> OK. Making it some sort of do/do-not bundle flag makes sense. How about just 
> `--[no-]gpu-bundle-output`?

--[no]gpu-bundle-output sounds good.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-04-30 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl created this revision.
yaxunl added a reviewer: tra.
yaxunl requested review of this revision.

When clang compiles a HIP program with -E, there are multiple output
files for host and different GPU archs. Clang uses clang-offload-bundler
to bundle them as one output file.

Currently clang does this for combined host/device compilation but does not do
it for device-only compilation. This causes issue when there are multiple
GPU arch's in device-only compilation. If -o is specified, the output for
different GPU arch's will overwrite each other.

This patch fixes that by bundling the output files if there are multiple
GPU arch's. This is consistent with the behavior for combined host/device 
compilation.
Clang will automatically unbundle it if the preprocessor expansion output
is used as input.

The same thing with -emit-llvm or -S.


https://reviews.llvm.org/D101630

Files:
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/clang-offload-bundler.c
  clang/test/Driver/hip-output-file-name.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip
  clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp

Index: clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
===
--- clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
+++ clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
@@ -117,6 +117,9 @@
 /// The index of the host input in the list of inputs.
 static unsigned HostInputIndex = ~0u;
 
+/// Whether not having host target is allowed.
+static bool AllowNoHost = false;
+
 /// Path to the current binary.
 static std::string BundlerExecutable;
 
@@ -857,9 +860,10 @@
   }
 
   // Get the file handler. We use the host buffer as reference.
-  assert(HostInputIndex != ~0u && "Host input index undefined??");
+  assert((HostInputIndex != ~0u || AllowNoHost) &&
+ "Host input index undefined??");
   Expected> FileHandlerOrErr =
-  CreateFileHandler(*InputBuffers[HostInputIndex]);
+  CreateFileHandler(*InputBuffers[AllowNoHost ? 0 : HostInputIndex]);
   if (!FileHandlerOrErr)
 return FileHandlerOrErr.takeError();
 
@@ -1126,6 +1130,7 @@
   // have exactly one host target.
   unsigned Index = 0u;
   unsigned HostTargetNum = 0u;
+  bool HIPOnly = true;
   llvm::DenseSet ParsedTargets;
   for (StringRef Target : TargetNames) {
 if (ParsedTargets.contains(Target)) {
@@ -1167,12 +1172,21 @@
   HostInputIndex = Index;
 }
 
+if (Kind != "hip" && Kind != "hipv4")
+  HIPOnly = false;
+
 ++Index;
   }
 
+  // HIP uses clang-offload-bundler to bundle device-only compilation results
+  // for multiple GPU archs, therefore allow no host target if all entries
+  // are for HIP.
+  AllowNoHost = HIPOnly;
+
   // Host triple is not really needed for unbundling operation, so do not
   // treat missing host triple as error if we do unbundling.
-  if ((Unbundle && HostTargetNum > 1) || (!Unbundle && HostTargetNum != 1)) {
+  if ((Unbundle && HostTargetNum > 1) ||
+  (!Unbundle && HostTargetNum != 1 && !AllowNoHost)) {
 reportError(createStringError(errc::invalid_argument,
   "expecting exactly one host target but got " +
   Twine(HostTargetNum)));
Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -56,8 +56,8 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx803"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"a.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"a.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} "-o" {{".*a.*bc"}} "-x" "hip"
+// EMITLL-SAME: {{.*}} "-o" {{".*a.*ll"}} "-x" "hip"
 // CHECK-SAME: {{.*}} {{".*a.cu"}}
 
 // COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
@@ -69,10 +69,14 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx900"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"a.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"a.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} "-o" {{".*a.*bc"}} "-x" "hip"
+// EMITLL-SAME: {{.*}} "-o" {{".*a.*ll"}} "-x" "hip"
 // COMMON-SAME: {{.*}} {{".*a.cu"}}
 
+// COMMON: "{{.*}}clang-offload-bundler" "-type={{(bc|ll)}}"
+// COMMON-SAME: "-targets=hip-amdgcn-amd-amdhsa-gfx803,hip-amdgcn-amd-amdhsa-gfx900"
+// COMMON-SAME: "-outputs=a-hip-amdgcn-amd-amdhsa.{{(bc|ll)}}"
+
 // COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
 // COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
 // EMITBC-SAME: "-emit-llvm-bc"
@@ -82,8 +86,8 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx803"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"b.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"b.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} 

[PATCH] D101630: [HIP] Fix device-only compilation

2021-04-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

CUDA compilation currently errors out if `-o` is used when more than one output 
would be produced.
E.g.

  % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
--cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E 
  #... preprocessed output from host and 2 GPU compilations is printed out
  
  % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
--cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E  -o foo.out
  clang-13: error: cannot specify -o when generating multiple output files
  
  % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
--cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -E  -o foo.out
  clang-13: error: cannot specify -o when generating multiple output files
  
  % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
--cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -S  -o foo.out
  clang-13: error: cannot specify -o when generating multiple output files

I think I've borrowed that behavior from some of the macos-related 
functionality, so we do have a somewhat established model of how to handle 
multiple outputs.
Wrapping multiple outputs into a single bundle could be an option too.

The question is -- what would make most sense.
Are bundles useful in cases when the user would use options that give us 
intermediate compiler outputs?

In my experience, most of such use cases are intended for manual examination of 
compiler output and as such I'd prefer to keep the results immediately usable, 
without having to unbundle them. In such cases we're already changing command 
line options and adjusting them to produce the output from the specific 
sub-compilation I want is trivial. Having to unbundle things is more 
complicated as the bundler/unbundler tool as it is right now is poorly 
documented and is not particularly user-friendly. If it is to become a 
user-facing tool like ar/nm/objdump, it would need some improvements.

If you do have use cases when you do need to bundle intermediate results, are 
they for the human consumption or for tooling? Perhaps we should make the 
"bundle the outputs" behavior an controllable by a flag, and keep enforcing 
"one output only" as the default.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-04-30 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2729573 , @tra wrote:

> CUDA compilation currently errors out if `-o` is used when more than one 
> output would be produced.
> E.g.
>
>   % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
> --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E 
>   #... preprocessed output from host and 2 GPU compilations is printed out
>   
>   % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
> --cuda-path=$HOME/local/cuda-10.2  zz.cu -c -E  -o foo.out
>   clang-13: error: cannot specify -o when generating multiple output files
>   
>   % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
> --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -E  -o foo.out
>   clang-13: error: cannot specify -o when generating multiple output files
>   
>   % bin/clang++ -x cuda --offload-arch=sm_60 --offload-arch=sm_70 
> --cuda-path=$HOME/local/cuda-10.2  zz.cu -c --cuda-device-only -S  -o foo.out
>   clang-13: error: cannot specify -o when generating multiple output files
>
> I think I've borrowed that behavior from some of the macos-related 
> functionality, so we do have a somewhat established model of how to handle 
> multiple outputs.
> Wrapping multiple outputs into a single bundle could be an option too.
>
> The question is -- what would make most sense.
> Are bundles useful in cases when the user would use options that give us 
> intermediate compiler outputs?
>
> In my experience, most of such use cases are intended for manual examination 
> of compiler output and as such I'd prefer to keep the results immediately 
> usable, without having to unbundle them. In such cases we're already changing 
> command line options and adjusting them to produce the output from the 
> specific sub-compilation I want is trivial. Having to unbundle things is more 
> complicated as the bundler/unbundler tool as it is right now is poorly 
> documented and is not particularly user-friendly. If it is to become a 
> user-facing tool like ar/nm/objdump, it would need some improvements.
>
> If you do have use cases when you do need to bundle intermediate results, are 
> they for the human consumption or for tooling? Perhaps we should make the 
> "bundle the outputs" behavior an controllable by a flag, and keep enforcing 
> "one output only" as the default.

We use ccache and need one output for -E with device compilation. Also there 
are use cases to emit bitcode for device compilation and link them later. These 
use cases require output to be bundled.

If users want to get the unbundled output, they can use -save-temps. Is it 
sufficient?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-04-30 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

What will happen with this patch in the following scenarios:

- `--offload_arch=A -S -o out.s`
- `--offload_arch=A --offload-arch=B -S -o out.s`

I would expect the first case to produce a plain text assembly file. With this 
patch the second case will produce a bundle. With some build tools users only 
add to the various compiler options provided by the system. Depending on 
whether those system-provided options include an `--offload-arch`, the format 
of the output in the first example becomes unstable. So the consistent way 
would be to always bundle everything, but that breaks (or at least complicates) 
the normal single-output case and makes it deviate from what users expect from 
a regular C++ compilation.

In D101630#2729768 , @yaxunl wrote:

> We use ccache and need one output for -E with device compilation. Also there 
> are use cases to emit bitcode for device compilation and link them later. 
> These use cases require output to be bundled.

This is a good point. I don't think I've ever used ccache on a CUDA 
compilation, but I see how ccache may get surprised.

Considering the scenario above, I think a better way to handle it would be to 
teach ccache about CUDA/HIP compilation. It's a similar situation with support 
for split DWARF, when compiler does something beyond the expected one-input to 
one-output transformation.
E.g. we could tell it to use stdout for `-E`. Or implement the 
`bundle-everything` flag in clang and let ccache use it if it needs to have a 
single output.

> If users want to get the unbundled output, they can use -save-temps. Is it 
> sufficient?

In terms of saving intermediate outputs - yes. In terms of usability - no. 
Sometimes I want one particular intermediate result saved with exact filename 
(or piped to stdout) and saving bunch and then picking one would be a pretty 
annoying usability regression for me.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-04-30 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2729975 , @tra wrote:

> What will happen with this patch in the following scenarios:
>
> - `--offload_arch=A -S -o out.s`
> - `--offload_arch=A --offload-arch=B -S -o out.s`
>
> I would expect the first case to produce a plain text assembly file. With 
> this patch the second case will produce a bundle. With some build tools users 
> only add to the various compiler options provided by the system. Depending on 
> whether those system-provided options include an `--offload-arch`, the format 
> of the output in the first example becomes unstable. So the consistent way 
> would be to always bundle everything, but that breaks (or at least 
> complicates) the normal single-output case and makes it deviate from what 
> users expect from a regular C++ compilation.
>
> In D101630#2729768 , @yaxunl wrote:
>
>> We use ccache and need one output for -E with device compilation. Also there 
>> are use cases to emit bitcode for device compilation and link them later. 
>> These use cases require output to be bundled.
>
> This is a good point. I don't think I've ever used ccache on a CUDA 
> compilation, but I see how ccache may get surprised.
>
> Considering the scenario above, I think a better way to handle it would be to 
> teach ccache about CUDA/HIP compilation. It's a similar situation with 
> support for split DWARF, when compiler does something beyond the expected 
> one-input to one-output transformation.
> E.g. we could tell it to use stdout for `-E`. Or implement the 
> `bundle-everything` flag in clang and let ccache use it if it needs to have a 
> single output.
>
>> If users want to get the unbundled output, they can use -save-temps. Is it 
>> sufficient?
>
> In terms of saving intermediate outputs - yes. In terms of usability - no. 
> Sometimes I want one particular intermediate result saved with exact filename 
> (or piped to stdout) and saving bunch and then picking one would be a pretty 
> annoying usability regression for me.

How about an option -fhip-bundle-device-output. If it is on, device output is 
bundled no matter how many GPU arch there are. By default it is on.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-01 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl updated this revision to Diff 342182.
yaxunl edited the summary of this revision.
yaxunl added a comment.
Herald added a subscriber: dang.

added option -fhip-bundle-device-output


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

Files:
  clang/include/clang/Driver/Options.td
  clang/lib/Driver/Driver.cpp
  clang/test/Driver/clang-offload-bundler.c
  clang/test/Driver/hip-device-compile.hip
  clang/test/Driver/hip-output-file-name.hip
  clang/test/Driver/hip-phases.hip
  clang/test/Driver/hip-rdc-device-only.hip
  clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp

Index: clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
===
--- clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
+++ clang/tools/clang-offload-bundler/ClangOffloadBundler.cpp
@@ -117,6 +117,9 @@
 /// The index of the host input in the list of inputs.
 static unsigned HostInputIndex = ~0u;
 
+/// Whether not having host target is allowed.
+static bool AllowNoHost = false;
+
 /// Path to the current binary.
 static std::string BundlerExecutable;
 
@@ -857,9 +860,10 @@
   }
 
   // Get the file handler. We use the host buffer as reference.
-  assert(HostInputIndex != ~0u && "Host input index undefined??");
+  assert((HostInputIndex != ~0u || AllowNoHost) &&
+ "Host input index undefined??");
   Expected> FileHandlerOrErr =
-  CreateFileHandler(*InputBuffers[HostInputIndex]);
+  CreateFileHandler(*InputBuffers[AllowNoHost ? 0 : HostInputIndex]);
   if (!FileHandlerOrErr)
 return FileHandlerOrErr.takeError();
 
@@ -1126,6 +1130,7 @@
   // have exactly one host target.
   unsigned Index = 0u;
   unsigned HostTargetNum = 0u;
+  bool HIPOnly = true;
   llvm::DenseSet ParsedTargets;
   for (StringRef Target : TargetNames) {
 if (ParsedTargets.contains(Target)) {
@@ -1167,12 +1172,21 @@
   HostInputIndex = Index;
 }
 
+if (Kind != "hip" && Kind != "hipv4")
+  HIPOnly = false;
+
 ++Index;
   }
 
+  // HIP uses clang-offload-bundler to bundle device-only compilation results
+  // for multiple GPU archs, therefore allow no host target if all entries
+  // are for HIP.
+  AllowNoHost = HIPOnly;
+
   // Host triple is not really needed for unbundling operation, so do not
   // treat missing host triple as error if we do unbundling.
-  if ((Unbundle && HostTargetNum > 1) || (!Unbundle && HostTargetNum != 1)) {
+  if ((Unbundle && HostTargetNum > 1) ||
+  (!Unbundle && HostTargetNum != 1 && !AllowNoHost)) {
 reportError(createStringError(errc::invalid_argument,
   "expecting exactly one host target but got " +
   Twine(HostTargetNum)));
Index: clang/test/Driver/hip-rdc-device-only.hip
===
--- clang/test/Driver/hip-rdc-device-only.hip
+++ clang/test/Driver/hip-rdc-device-only.hip
@@ -56,8 +56,8 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx803"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"a.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"a.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} "-o" {{".*a.*bc"}} "-x" "hip"
+// EMITLL-SAME: {{.*}} "-o" {{".*a.*ll"}} "-x" "hip"
 // CHECK-SAME: {{.*}} {{".*a.cu"}}
 
 // COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
@@ -69,10 +69,14 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx900"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"a.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"a.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} "-o" {{".*a.*bc"}} "-x" "hip"
+// EMITLL-SAME: {{.*}} "-o" {{".*a.*ll"}} "-x" "hip"
 // COMMON-SAME: {{.*}} {{".*a.cu"}}
 
+// COMMON: "{{.*}}clang-offload-bundler" "-type={{(bc|ll)}}"
+// COMMON-SAME: "-targets=hip-amdgcn-amd-amdhsa-gfx803,hip-amdgcn-amd-amdhsa-gfx900"
+// COMMON-SAME: "-outputs=a-hip-amdgcn-amd-amdhsa.{{(bc|ll)}}"
+
 // COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
 // COMMON-SAME: "-aux-triple" "x86_64-unknown-linux-gnu"
 // EMITBC-SAME: "-emit-llvm-bc"
@@ -82,8 +86,8 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx803"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"b.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"b.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} "-o" {{".*b.*bc"}} "-x" "hip"
+// EMITLL-SAME: {{.*}} "-o" {{".*b.*ll"}} "-x" "hip"
 // COMMON-SAME: {{.*}} {{".*b.hip"}}
 
 // COMMON: [[CLANG]] "-cc1" "-triple" "amdgcn-amd-amdhsa"
@@ -95,10 +99,14 @@
 // COMMON-SAME: "-fapply-global-visibility-to-externs"
 // COMMON-SAME: "-target-cpu" "gfx900"
 // COMMON-SAME: "-fgpu-rdc"
-// EMITBC-SAME: {{.*}} "-o" {{"b.*bc"}} "-x" "hip"
-// EMITLL-SAME: {{.*}} "-o" {{"b.*ll"}} "-x" "hip"
+// EMITBC-SAME: {{.*}} "-o" {{".*b.*bc"}} "-x" "hip"
+// EMITLL-SAME: {{.*}} "-

[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-03 Thread Artem Belevich via Phabricator via cfe-commits
tra added a subscriber: jdoerfert.
tra added a comment.

In D101630#2730273 , @yaxunl wrote:

> How about an option -fhip-bundle-device-output. If it is on, device output is 
> bundled no matter how many GPU arch there are. By default it is on.

+1 to the option, but I can't say I'm particularly happy about the default. I'd 
still prefer the default to be a no-bundling + an error in cases when we'd 
nominally produce multiple outputs.
We could use a third opinion here.

@jdoerfert : Do you have any thoughts on what would be a sensible default when 
a user uses `-S -o foo.s` for compilations that may produce multiple results? I 
think OpenMP may have to deal with similar issues.

On one hand it would be convenient for ccache to just work with the CUDA/HIP 
compilation out of the box. Compiler always produces one output file, 
regardless of what it does under the hood and ccache may not care what's in it.

On the other, this behavior breaks user expectations. I.e. `clang -S` is 
supposed to produce the assembly, not an opaque binary bundle blob.
Using an `-S` with multiple sub-compilations is also likely an error on the 
user's end and should be explicitly diagnosed and that's how it currently work.
Using `-fno-hip-bundle-device-output` to restore the expected behavior puts the 
burden on the wrong side, IMO.  I believe, it should be ccache which should be 
using `-fhip-bundle-device-output` to deal with the CUDA/HIP compilations.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-04 Thread Jan Svoboda via Phabricator via cfe-commits
jansvoboda11 added inline comments.



Comment at: clang/include/clang/Driver/Options.td:977
   NegFlag>;
+defm hip_bundle_device_output : BoolFOption<"hip-bundle-device-output", 
EmptyKPM, DefaultTrue,
+  PosFlag,

The TableGen marshalling infrastructure (`BoolFOption` et. al.) is only 
intended for flags that map to the `-cc1` frontend and its `CompilerInvocation` 
class. (`EmptyKPM` that disables this mapping shouldn't even exist anymore.)

Since these flags only work on the driver level, use something like this 
instead:

```
def fhip_bundle_device_output : Flag<["-"], "fhip-bundle-device-output">, 
Group;
def fno_hip_bundle_device_output : Flag<["-"], "fno-hip-bundle-device-output">, 
Group, HelpText<"Do not bundle output files of HIP device 
compilation">;
```


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-04 Thread Artem Belevich via Phabricator via cfe-commits
tra added inline comments.



Comment at: clang/include/clang/Driver/Options.td:977
   NegFlag>;
+defm hip_bundle_device_output : BoolFOption<"hip-bundle-device-output", 
EmptyKPM, DefaultTrue,
+  PosFlag,

jansvoboda11 wrote:
> The TableGen marshalling infrastructure (`BoolFOption` et. al.) is only 
> intended for flags that map to the `-cc1` frontend and its 
> `CompilerInvocation` class. (`EmptyKPM` that disables this mapping shouldn't 
> even exist anymore.)
> 
> Since these flags only work on the driver level, use something like this 
> instead:
> 
> ```
> def fhip_bundle_device_output : Flag<["-"], "fhip-bundle-device-output">, 
> Group;
> def fno_hip_bundle_device_output : Flag<["-"], 
> "fno-hip-bundle-device-output">, Group, HelpText<"Do not bundle 
> output files of HIP device compilation">;
> ```

It would be great if `BoolFOption` would at least document that assumption.
It would be even better if it would be allowed to be used to implement a 
driver-only option, maybe with a flag.
The class is very useful to handle -fsomething/-fno-something and restricting 
it to cc1 only will continue to be a constant source of this kind of confusion. 
I think we've already seen handful of them in reviews since the flag tablegen 
got overhauled.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-07 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment.

In D101630#2733761 , @tra wrote:

> In D101630#2730273 , @yaxunl wrote:
>
>> How about an option -fhip-bundle-device-output. If it is on, device output 
>> is bundled no matter how many GPU arch there are. By default it is on.
>
> +1 to the option, but I can't say I'm particularly happy about the default. 
> I'd still prefer the default to be a no-bundling + an error in cases when 
> we'd nominally produce multiple outputs.
> We could use a third opinion here.
>
> @jdoerfert : Do you have any thoughts on what would be a sensible default 
> when a user uses `-S -o foo.s` for compilations that may produce multiple 
> results? I think OpenMP may have to deal with similar issues.
>
> On one hand it would be convenient for ccache to just work with the CUDA/HIP 
> compilation out of the box. Compiler always produces one output file, 
> regardless of what it does under the hood and ccache may not care what's in 
> it.
>
> On the other, this behavior breaks user expectations. I.e. `clang -S` is 
> supposed to produce the assembly, not an opaque binary bundle blob.
> Using an `-S` with multiple sub-compilations is also likely an error on the 
> user's end and should be explicitly diagnosed and that's how it currently 
> work.
> Using `-fno-hip-bundle-device-output` to restore the expected behavior puts 
> the burden on the wrong side, IMO.  I believe, it should be ccache which 
> should be using `-fhip-bundle-device-output` to deal with the CUDA/HIP 
> compilations.

I choose to emit the bundled output by default since it is the convention for 
compiler to have one output. The compilation is like a pipeline. If we break it 
into stages, users would expect to use the output from one stage as input for 
the next stage. This is possible only if there is one output. This happens with 
host compilations and combined device/host compilations. I would see it is a 
surprise that this is not true for device compilation.

Also, when users do not want the output to be bundled, it is usually for 
debugging or special purposes. Users need to know the naming convention of the 
multiple outputs. I think it is justifiable to enable this by an option.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D101630: [HIP] Fix device-only compilation

2021-05-10 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment.

In D101630#2744861 , @yaxunl wrote:

> [snip] it is the convention for compiler to have one output. 
> The compilation is like a pipeline. If we break it into stages, users would 
> expect to use the output from one stage as input for the next stage. This is 
> possible only if there is one output. 
> Also, when users do not want the output to be bundled, it is usually for 
> debugging or special purposes. Users need to know the naming convention of 
> the multiple outputs. I think it is justifiable to enable this by an option.

So in the end it's a trade-off between tools like ccache working out of the box 
vs additional option that would need to be used by users we do need specific 
intermediate output. 
Considering that intermediate output already need special handling, I'll agree 
that bundling by default is likely more useful.

Now the question is how to make it work without breaking existing users.

There are some tools that rely on clang `--cuda-host-only` and 
`--cuda-device-only` to work as if it was a regular C++ compilation. E.g. 
godbolt.org. 
It may be useful to do bundling **only** if we're dealing with multiple 
outputs. 
On the other hand, it may puzzle users why they get a bundle with `clang -S 
--offload_arch=X --offload_arch=Y` but plain text assembly with `clang -S 
--offload_arch=X`.

How about this:
If the user explicitly specified `--cuda-host-only` or `--cuda-device-only`, 
then by default only allow producing the natural output format, unless a 
bundled output is requested by an option. This should keep existing users 
working.
If the compilation is done without explicitly requested sub-compilation(s), 
then bundle the output by default. This should keep the GPU-unaware tools like 
ccache happy as they would always get the single output they expect.

WDYT?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101630/new/

https://reviews.llvm.org/D101630

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits