[PATCH] D134824: Fix frint ACLE intrinsic names

2022-09-29 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 accepted this revision.
jaykang10 added a comment.
This revision is now accepted and ready to land.

Thanks for fixing it!
LGTM


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D134824/new/

https://reviews.llvm.org/D134824

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D108470: [OpenCL] Fix as_type3 invalid store creation

2021-08-23 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In D108470#2959715 , @svenvh wrote:

> In D108470#2959708 , @jaykang10 
> wrote:
>
>> Can you also add "double2_to_float3"
>
> Instead of `double2_to_float3`, I decided to add `char8_to_short3`.  Same 
> idea (vectorN to vector3), but reinterpreting 8 chars as 3 shorts feels like 
> a more realistic case than reinterpreting 2 doubles as 3 floats.  But I'm 
> happy to add `double2_to_float3` if you think there is a good reason to do so.

I am sorry for asking more tests because I did not add enough tests previously.
I do not know well how many type conversions with vec3 the recent OpenCL spec 
cover but it could be good to add more the tests of the type conversions which 
the spec mentions. If you feel the tests you have added are enough for it, I am 
ok with it.

>> and "float4_to_float3" tests please?
>
> That test is already there, see line 17.

Sorry, I missed it.

LGTM


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108470/new/

https://reviews.llvm.org/D108470

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D108470: [OpenCL] Fix as_type3 invalid store creation

2021-08-23 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

Can you also add "double2_to_float3" and "float4_to_float3" tests please?


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108470/new/

https://reviews.llvm.org/D108470

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D108470: [OpenCL] Fix as_type3 invalid store creation

2021-08-23 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

If possible, can you add more tests for different types as previous patch 
please?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108470/new/

https://reviews.llvm.org/D108470

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D107963: [OpenCL] Fix as_type(vec3) invalid store creation

2021-08-17 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added inline comments.



Comment at: clang/lib/CodeGen/CGExprScalar.cpp:4789
-
-if (!CGF.CGM.getCodeGenOpts().PreserveVec3Type) {
-  Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,

svenvh wrote:
> jaykang10 wrote:
> > svenvh wrote:
> > > Anastasia wrote:
> > > > While I agree with this fix and it obviously looks incorrect, I wonder 
> > > > if the original intent was to condition the previous statement instead 
> > > > so that we avoid converting to size 4 at all? Although I have a feeling 
> > > > we are entering the behavior that is not documented anywhere. In the 
> > > > spec I can see this:
> > > > 
> > > > 
> > > > ```
> > > > When the operand and result type contain a different number of 
> > > > elements, the result shall be implementation-defined except if the 
> > > > operand is a 4-component vector and the result is a 3-component vector. 
> > > > In this case, the bits in the operand shall be returned directly 
> > > > without modification as the new type. 
> > > > ```
> > > > 
> > > > but it seems to cover the inverse conversion?
> > > Yeah I have a similar fix for the inverse case (which is further down in 
> > > this function) in my local branch.
> > > 
> > > I did try to extend the guard to also cover the `ConvertVec3AndVec4` 
> > > call, but that also led to invalid StoreInst creation.  Since I wasn't 
> > > sure about the intent of the conditioning on `PreserveVec3Type` here, I 
> > > didn't investigate further.
> > > 
> > > I was hoping @jaykang10 (who added this in D30810) might have some 
> > > insight into why the guard was here in the first place.  But it has been 
> > > over 4 years since that was committed, so there might not be a ready 
> > > answer.  Either way, I'll hold off committing this for a few more days.
> > I am sorry for late response. I has not been feeling well.
> > 
> > As far as I remember, the goal was to avoid bitcast and keep load or store 
> > with vec3 type on IR level. I guess I did not consider the conversion from 
> > vec3 type to scalar type and vice versa.
> > 
> > I guess this guard was to avoid the bitcast. It could be wrong for scalar 
> > type. If you check the scalar type in the guard, it could be good to keep 
> > existing behavior for vector type.
> > 
> > Additionally, you could also want to change below code for conversion from 
> > non-vec3 to vec3.
> No worries, thanks for replying!
> 
> > the goal was to avoid bitcast and keep load or store with vec3 type on IR 
> > level.
> 
> I think that is already achieved by the changes in CGExpr.cpp from your 
> previous commit.  But here in CGExprScalar.cpp we are handling the case where 
> we have to convert away to non-vec3 (because `NumElementsDst != 3`) and we do 
> this conversion unconditionally already.  I don't see why we would not want 
> to emit the bitcast because it is needed for correctness.
> 
> > It could be wrong for scalar type.
> 
> The problem that my patch fixes is not limited to scalar types: it also 
> occurs for e.g. `float3` to `double2`.  Perhaps I should add that test case 
> too?
> 
> > If you check the scalar type in the guard, it could be good to keep 
> > existing behavior for vector type.
> 
> My patch does not make a difference to any of the pre-existing tests in 
> `preserve_vec3.cl`.  Do you have a specific case that is not covered by the 
> test, but for which you want to preserve the behavior?
> I think that is already achieved by the changes in CGExpr.cpp from your 
> previous commit. But here in CGExprScalar.cpp we are handling the case where 
> we have to convert away to non-vec3 (because NumElementsDst != 3) and we do 
> this conversion unconditionally already. I don't see why we would not want to 
> emit the bitcast because it is needed for correctness.
I agree with you. I remember vaguely it was for a transformation pass in my 
previous project. For correctness, please feel free to remove the guard. 

>The problem that my patch fixes is not limited to scalar types: it also occurs 
>for e.g. float3 to double2. Perhaps I should add that test case too?
Yep, if you add more test cases, it will be great.

> My patch does not make a difference to any of the pre-existing tests in 
> preserve_vec3.cl. Do you have a specific case that is not covered by the 
> test, but for which you want to preserve the behavior?
I can not remember correctly what my previous patch aimed. If someone raises 
issues with removing this guard later, I think we can discuss it again.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107963/new/

https://reviews.llvm.org/D107963

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D107963: [OpenCL] Fix as_type(vec3) invalid store creation

2021-08-17 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added inline comments.



Comment at: clang/lib/CodeGen/CGExprScalar.cpp:4789
-
-if (!CGF.CGM.getCodeGenOpts().PreserveVec3Type) {
-  Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,

svenvh wrote:
> Anastasia wrote:
> > While I agree with this fix and it obviously looks incorrect, I wonder if 
> > the original intent was to condition the previous statement instead so that 
> > we avoid converting to size 4 at all? Although I have a feeling we are 
> > entering the behavior that is not documented anywhere. In the spec I can 
> > see this:
> > 
> > 
> > ```
> > When the operand and result type contain a different number of elements, 
> > the result shall be implementation-defined except if the operand is a 
> > 4-component vector and the result is a 3-component vector. In this case, 
> > the bits in the operand shall be returned directly without modification as 
> > the new type. 
> > ```
> > 
> > but it seems to cover the inverse conversion?
> Yeah I have a similar fix for the inverse case (which is further down in this 
> function) in my local branch.
> 
> I did try to extend the guard to also cover the `ConvertVec3AndVec4` call, 
> but that also led to invalid StoreInst creation.  Since I wasn't sure about 
> the intent of the conditioning on `PreserveVec3Type` here, I didn't 
> investigate further.
> 
> I was hoping @jaykang10 (who added this in D30810) might have some insight 
> into why the guard was here in the first place.  But it has been over 4 years 
> since that was committed, so there might not be a ready answer.  Either way, 
> I'll hold off committing this for a few more days.
I am sorry for late response. I has not been feeling well.

As far as I remember, the goal was to avoid bitcast and keep load or store with 
vec3 type on IR level. I guess I did not consider the conversion from vec3 type 
to scalar type and vice versa.

I guess this guard was to avoid the bitcast. It could be wrong for scalar type. 
If you check the scalar type in the guard, it could be good to keep existing 
behavior for vector type.

Additionally, you could also want to change below code for conversion from 
non-vec3 to vec3.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107963/new/

https://reviews.llvm.org/D107963

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D98269: [AArch64] Add missing intrinsics for scalar fp rounding

2021-03-10 Thread JinGu Kang via Phabricator via cfe-commits
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG25951c5ab8e9: [AArch64] Add missing intrinsics for scalar FP 
rounding (authored by jaykang10).
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98269/new/

https://reviews.llvm.org/D98269

Files:
  clang/include/clang/Basic/BuiltinsAArch64.def
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Headers/arm_acle.h
  clang/test/CodeGen/aarch64-v8.5a-scalar-frint3264-intrinsic.c
  llvm/include/llvm/IR/IntrinsicsAArch64.td
  llvm/lib/Target/AArch64/AArch64InstrInfo.td
  llvm/test/CodeGen/AArch64/v8.5a-scalar-frint3264-intrinsic.ll

Index: llvm/test/CodeGen/AArch64/v8.5a-scalar-frint3264-intrinsic.ll
===
--- /dev/null
+++ llvm/test/CodeGen/AArch64/v8.5a-scalar-frint3264-intrinsic.ll
@@ -0,0 +1,83 @@
+; RUN: llc < %s -mtriple=aarch64-eabi -mattr=+v8.5a  | FileCheck %s
+
+declare float @llvm.aarch64.frint32z.f32(float)
+declare double @llvm.aarch64.frint32z.f64(double)
+declare float @llvm.aarch64.frint64z.f32(float)
+declare double @llvm.aarch64.frint64z.f64(double)
+
+define dso_local float @t_frint32z(float %a) {
+; CHECK-LABEL: t_frint32z:
+; CHECK: frint32z s0, s0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call float @llvm.aarch64.frint32z.f32(float %a)
+  ret float %val
+}
+
+define dso_local double @t_frint32zf(double %a) {
+; CHECK-LABEL: t_frint32zf:
+; CHECK: frint32z d0, d0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call double @llvm.aarch64.frint32z.f64(double %a)
+  ret double %val
+}
+
+define dso_local float @t_frint64z(float %a) {
+; CHECK-LABEL: t_frint64z:
+; CHECK: frint64z s0, s0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call float @llvm.aarch64.frint64z.f32(float %a)
+  ret float %val
+}
+
+define dso_local double @t_frint64zf(double %a) {
+; CHECK-LABEL: t_frint64zf:
+; CHECK: frint64z d0, d0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call double @llvm.aarch64.frint64z.f64(double %a)
+  ret double %val
+}
+
+declare float @llvm.aarch64.frint32x.f32(float)
+declare double @llvm.aarch64.frint32x.f64(double)
+declare float @llvm.aarch64.frint64x.f32(float)
+declare double @llvm.aarch64.frint64x.f64(double)
+
+define dso_local float @t_frint32x(float %a) {
+; CHECK-LABEL: t_frint32x:
+; CHECK: frint32x s0, s0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call float @llvm.aarch64.frint32x.f32(float %a)
+  ret float %val
+}
+
+define dso_local double @t_frint32xf(double %a) {
+; CHECK-LABEL: t_frint32xf:
+; CHECK: frint32x d0, d0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call double @llvm.aarch64.frint32x.f64(double %a)
+  ret double %val
+}
+
+define dso_local float @t_frint64x(float %a) {
+; CHECK-LABEL: t_frint64x:
+; CHECK: frint64x s0, s0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call float @llvm.aarch64.frint64x.f32(float %a)
+  ret float %val
+}
+
+define dso_local double @t_frint64xf(double %a) {
+; CHECK-LABEL: t_frint64xf:
+; CHECK: frint64x d0, d0
+; CHECK-NEXT:ret
+entry:
+  %val = tail call double @llvm.aarch64.frint64x.f64(double %a)
+  ret double %val
+}
Index: llvm/lib/Target/AArch64/AArch64InstrInfo.td
===
--- llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -3805,10 +3805,10 @@
 }
 
 let Predicates = [HasFRInt3264] in {
-  defm FRINT32Z : FRIntNNT<0b00, "frint32z">;
-  defm FRINT64Z : FRIntNNT<0b10, "frint64z">;
-  defm FRINT32X : FRIntNNT<0b01, "frint32x">;
-  defm FRINT64X : FRIntNNT<0b11, "frint64x">;
+  defm FRINT32Z : FRIntNNT<0b00, "frint32z", int_aarch64_frint32z>;
+  defm FRINT64Z : FRIntNNT<0b10, "frint64z", int_aarch64_frint64z>;
+  defm FRINT32X : FRIntNNT<0b01, "frint32x", int_aarch64_frint32x>;
+  defm FRINT64X : FRIntNNT<0b11, "frint64x", int_aarch64_frint64x>;
 } // HasFRInt3264
 
 let Predicates = [HasFullFP16] in {
Index: llvm/include/llvm/IR/IntrinsicsAArch64.td
===
--- llvm/include/llvm/IR/IntrinsicsAArch64.td
+++ llvm/include/llvm/IR/IntrinsicsAArch64.td
@@ -44,6 +44,19 @@
 def int_aarch64_cls: DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoMem]>;
 def int_aarch64_cls64: DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;
 
+def int_aarch64_frint32z
+: DefaultAttrsIntrinsic<[ llvm_anyfloat_ty ], [ LLVMMatchType<0> ],
+[ IntrNoMem ]>;
+def int_aarch64_frint64z
+: DefaultAttrsIntrinsic<[ llvm_anyfloat_ty ], [ LLVMMatchType<0> ],
+[ IntrNoMem ]>;
+def int_aarch64_frint32x
+: DefaultAttrsIntrinsic<[ llvm_anyfloat_ty ], [ LLVMMatchType<0> ],
+[ IntrNoMem ]>;
+def 

[PATCH] D97775: [AArch64] Add missing intrinsics for vcls

2021-03-03 Thread JinGu Kang via Phabricator via cfe-commits
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
Closed by commit rG394a4d04333e: [AArch64] Add missing intrinsics for vcls 
(authored by jaykang10).
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D97775/new/

https://reviews.llvm.org/D97775

Files:
  clang/include/clang/Basic/arm_neon.td
  clang/test/CodeGen/aarch64-neon-misc.c
  clang/test/CodeGen/arm_neon_intrinsics.c

Index: clang/test/CodeGen/arm_neon_intrinsics.c
===
--- clang/test/CodeGen/arm_neon_intrinsics.c
+++ clang/test/CodeGen/arm_neon_intrinsics.c
@@ -1776,6 +1776,31 @@
   return vcls_s32(a);
 }
 
+// CHECK-LABEL: @test_vcls_u8(
+// CHECK:   [[VCLS_V_I:%.*]] = call <8 x i8> @llvm.arm.neon.vcls.v8i8(<8 x i8> %a)
+// CHECK:   ret <8 x i8> [[VCLS_V_I]]
+int8x8_t test_vcls_u8(uint8x8_t a) {
+  return vcls_u8(a);
+}
+
+// CHECK-LABEL: @test_vcls_u16(
+// CHECK:   [[TMP0:%.*]] = bitcast <4 x i16> %a to <8 x i8>
+// CHECK:   [[VCLS_V1_I:%.*]] = call <4 x i16> @llvm.arm.neon.vcls.v4i16(<4 x i16> %a)
+// CHECK:   [[VCLS_V2_I:%.*]] = bitcast <4 x i16> [[VCLS_V1_I]] to <8 x i8>
+// CHECK:   ret <4 x i16> [[VCLS_V1_I]]
+int16x4_t test_vcls_u16(uint16x4_t a) {
+  return vcls_u16(a);
+}
+
+// CHECK-LABEL: @test_vcls_u32(
+// CHECK:   [[TMP0:%.*]] = bitcast <2 x i32> %a to <8 x i8>
+// CHECK:   [[VCLS_V1_I:%.*]] = call <2 x i32> @llvm.arm.neon.vcls.v2i32(<2 x i32> %a)
+// CHECK:   [[VCLS_V2_I:%.*]] = bitcast <2 x i32> [[VCLS_V1_I]] to <8 x i8>
+// CHECK:   ret <2 x i32> [[VCLS_V1_I]]
+int32x2_t test_vcls_u32(uint32x2_t a) {
+  return vcls_u32(a);
+}
+
 // CHECK-LABEL: @test_vclsq_s8(
 // CHECK:   [[VCLSQ_V_I:%.*]] = call <16 x i8> @llvm.arm.neon.vcls.v16i8(<16 x i8> %a)
 // CHECK:   ret <16 x i8> [[VCLSQ_V_I]]
@@ -1801,6 +1826,31 @@
   return vclsq_s32(a);
 }
 
+// CHECK-LABEL: @test_vclsq_u8(
+// CHECK:   [[VCLSQ_V_I:%.*]] = call <16 x i8> @llvm.arm.neon.vcls.v16i8(<16 x i8> %a)
+// CHECK:   ret <16 x i8> [[VCLSQ_V_I]]
+int8x16_t test_vclsq_u8(uint8x16_t a) {
+  return vclsq_u8(a);
+}
+
+// CHECK-LABEL: @test_vclsq_u16(
+// CHECK:   [[TMP0:%.*]] = bitcast <8 x i16> %a to <16 x i8>
+// CHECK:   [[VCLSQ_V1_I:%.*]] = call <8 x i16> @llvm.arm.neon.vcls.v8i16(<8 x i16> %a)
+// CHECK:   [[VCLSQ_V2_I:%.*]] = bitcast <8 x i16> [[VCLSQ_V1_I]] to <16 x i8>
+// CHECK:   ret <8 x i16> [[VCLSQ_V1_I]]
+int16x8_t test_vclsq_u16(uint16x8_t a) {
+  return vclsq_u16(a);
+}
+
+// CHECK-LABEL: @test_vclsq_u32(
+// CHECK:   [[TMP0:%.*]] = bitcast <4 x i32> %a to <16 x i8>
+// CHECK:   [[VCLSQ_V1_I:%.*]] = call <4 x i32> @llvm.arm.neon.vcls.v4i32(<4 x i32> %a)
+// CHECK:   [[VCLSQ_V2_I:%.*]] = bitcast <4 x i32> [[VCLSQ_V1_I]] to <16 x i8>
+// CHECK:   ret <4 x i32> [[VCLSQ_V1_I]]
+int32x4_t test_vclsq_u32(uint32x4_t a) {
+  return vclsq_u32(a);
+}
+
 // CHECK-LABEL: @test_vclt_s8(
 // CHECK:   [[CMP_I:%.*]] = icmp slt <8 x i8> %a, %b
 // CHECK:   [[SEXT_I:%.*]] = sext <8 x i1> [[CMP_I]] to <8 x i8>
Index: clang/test/CodeGen/aarch64-neon-misc.c
===
--- clang/test/CodeGen/aarch64-neon-misc.c
+++ clang/test/CodeGen/aarch64-neon-misc.c
@@ -1432,6 +1432,13 @@
   return vcls_s8(a);
 }
 
+// CHECK-LABEL: @test_vcls_u8(
+// CHECK:   [[VCLS_V_I:%.*]] = call <8 x i8> @llvm.aarch64.neon.cls.v8i8(<8 x i8> %a)
+// CHECK:   ret <8 x i8> [[VCLS_V_I]]
+int8x8_t test_vcls_u8(uint8x8_t a) {
+  return vcls_u8(a);
+}
+
 // CHECK-LABEL: @test_vclsq_s8(
 // CHECK:   [[VCLSQ_V_I:%.*]] = call <16 x i8> @llvm.aarch64.neon.cls.v16i8(<16 x i8> %a)
 // CHECK:   ret <16 x i8> [[VCLSQ_V_I]]
@@ -1439,6 +1446,13 @@
   return vclsq_s8(a);
 }
 
+// CHECK-LABEL: @test_vclsq_u8(
+// CHECK:   [[VCLSQ_V_I:%.*]] = call <16 x i8> @llvm.aarch64.neon.cls.v16i8(<16 x i8> %a)
+// CHECK:   ret <16 x i8> [[VCLSQ_V_I]]
+int8x16_t test_vclsq_u8(uint8x16_t a) {
+  return vclsq_u8(a);
+}
+
 // CHECK-LABEL: @test_vcls_s16(
 // CHECK:   [[TMP0:%.*]] = bitcast <4 x i16> %a to <8 x i8>
 // CHECK:   [[VCLS_V1_I:%.*]] = call <4 x i16> @llvm.aarch64.neon.cls.v4i16(<4 x i16> %a)
@@ -1448,6 +1462,15 @@
   return vcls_s16(a);
 }
 
+// CHECK-LABEL: @test_vcls_u16(
+// CHECK:   [[TMP0:%.*]] = bitcast <4 x i16> %a to <8 x i8>
+// CHECK:   [[VCLS_V1_I:%.*]] = call <4 x i16> @llvm.aarch64.neon.cls.v4i16(<4 x i16> %a)
+// CHECK:   [[VCLS_V2_I:%.*]] = bitcast <4 x i16> [[VCLS_V1_I]] to <8 x i8>
+// CHECK:   ret <4 x i16> [[VCLS_V1_I]]
+int16x4_t test_vcls_u16(uint16x4_t a) {
+  return vcls_u16(a);
+}
+
 // CHECK-LABEL: @test_vclsq_s16(
 // CHECK:   [[TMP0:%.*]] = bitcast <8 x i16> %a to <16 x i8>
 // CHECK:   [[VCLSQ_V1_I:%.*]] = call <8 x i16> @llvm.aarch64.neon.cls.v8i16(<8 x i16> %a)
@@ -1457,6 +1480,15 @@
   return vclsq_s16(a);
 }
 
+// CHECK-LABEL: @test_vclsq_u16(
+// CHECK:   [[TMP0:%.*]] = bitcast <8 x i16> %a to 

[PATCH] D62962: Clang implementation of sizeless types

2019-06-07 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added inline comments.



Comment at: test/Sema/sizeless-1.c:29
+
+typedef svint8_t vec_int8 __attribute__((vector_size(64))); // expected-error 
{{invalid vector element type}}
+

Please check "__attribute__((ext_vector_type()))". I guess it is also invalid 
for this type.



Comment at: test/SemaCXX/sizeless-1.cpp:37
+
+typedef svint8_t vec_int8 __attribute__((vector_size(64))); // expected-error 
{{invalid vector element type}}
+

Please check "__attribute__((ext_vector_type()))". I guess it is also invalid 
for this type.


Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62962/new/

https://reviews.llvm.org/D62962



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-04-04 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#718176, @Anastasia wrote:

> LGTM! Thanks!


Thanks Anastasia! Merge Done!


Repository:
  rL LLVM

https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-04-04 Thread JinGu Kang via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rL299445: Preserve vec3 type. (authored by jaykang10).

Changed prior to commit:
  https://reviews.llvm.org/D30810?vs=92829=94088#toc

Repository:
  rL LLVM

https://reviews.llvm.org/D30810

Files:
  cfe/trunk/include/clang/Driver/CC1Options.td
  cfe/trunk/include/clang/Frontend/CodeGenOptions.def
  cfe/trunk/lib/CodeGen/CGExpr.cpp
  cfe/trunk/lib/CodeGen/CGExprScalar.cpp
  cfe/trunk/lib/Frontend/CompilerInvocation.cpp
  cfe/trunk/test/CodeGenOpenCL/preserve_vec3.cl

Index: cfe/trunk/include/clang/Frontend/CodeGenOptions.def
===
--- cfe/trunk/include/clang/Frontend/CodeGenOptions.def
+++ cfe/trunk/include/clang/Frontend/CodeGenOptions.def
@@ -263,6 +263,9 @@
 /// Whether emit extra debug info for sample pgo profile collection.
 CODEGENOPT(DebugInfoForProfiling, 1, 0)
 
+/// Whether 3-component vector type is preserved.
+CODEGENOPT(PreserveVec3Type, 1, 0)
+
 #undef CODEGENOPT
 #undef ENUM_CODEGENOPT
 #undef VALUE_CODEGENOPT
Index: cfe/trunk/include/clang/Driver/CC1Options.td
===
--- cfe/trunk/include/clang/Driver/CC1Options.td
+++ cfe/trunk/include/clang/Driver/CC1Options.td
@@ -658,6 +658,8 @@
   HelpText<"Set default MS calling convention">;
 def finclude_default_header : Flag<["-"], "finclude-default-header">,
   HelpText<"Include the default header file for OpenCL">;
+def fpreserve_vec3_type : Flag<["-"], "fpreserve-vec3-type">,
+  HelpText<"Preserve 3-component vector type">;
 
 // FIXME: Remove these entirely once functionality/tests have been excised.
 def fobjc_gc_only : Flag<["-"], "fobjc-gc-only">, Group,
Index: cfe/trunk/test/CodeGenOpenCL/preserve_vec3.cl
===
--- cfe/trunk/test/CodeGenOpenCL/preserve_vec3.cl
+++ cfe/trunk/test/CodeGenOpenCL/preserve_vec3.cl
@@ -0,0 +1,24 @@
+// RUN: %clang_cc1 %s -emit-llvm -o - -triple spir-unknown-unknown -fpreserve-vec3-type  | FileCheck %s
+
+typedef float float3 __attribute__((ext_vector_type(3)));
+typedef float float4 __attribute__((ext_vector_type(4)));
+
+void kernel foo(global float3 *a, global float3 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <3 x float>, <3 x float> addrspace(1)* %a
+  // CHECK: store <3 x float> %[[LOAD_A]], <3 x float> addrspace(1)* %b
+  *b = *a;
+}
+
+void kernel float4_to_float3(global float3 *a, global float4 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <4 x float>, <4 x float> addrspace(1)* %b, align 16
+  // CHECK: %[[ASTYPE:.*]] = shufflevector <4 x float> %[[LOAD_A]], <4 x float> undef, <3 x i32> 
+  // CHECK: store <3 x float> %[[ASTYPE:.*]], <3 x float> addrspace(1)* %a, align 16
+  *a = __builtin_astype(*b, float3);
+}
+
+void kernel float3_to_float4(global float3 *a, global float4 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <3 x float>, <3 x float> addrspace(1)* %a, align 16
+  // CHECK: %[[ASTYPE:.*]] = shufflevector <3 x float> %[[LOAD_A]], <3 x float> undef, <4 x i32> 
+  // CHECK: store <4 x float> %[[ASTYPE:.*]], <4 x float> addrspace(1)* %b, align 16
+  *b = __builtin_astype(*a, float4);
+}
Index: cfe/trunk/lib/Frontend/CompilerInvocation.cpp
===
--- cfe/trunk/lib/Frontend/CompilerInvocation.cpp
+++ cfe/trunk/lib/Frontend/CompilerInvocation.cpp
@@ -729,6 +729,7 @@
 }
   }
 
+  Opts.PreserveVec3Type = Args.hasArg(OPT_fpreserve_vec3_type);
   Opts.InstrumentFunctions = Args.hasArg(OPT_finstrument_functions);
   Opts.XRayInstrumentFunctions = Args.hasArg(OPT_fxray_instrument);
   Opts.XRayInstructionThreshold =
Index: cfe/trunk/lib/CodeGen/CGExprScalar.cpp
===
--- cfe/trunk/lib/CodeGen/CGExprScalar.cpp
+++ cfe/trunk/lib/CodeGen/CGExprScalar.cpp
@@ -3593,19 +3593,26 @@
   // vector to get a vec4, then a bitcast if the target type is different.
   if (NumElementsSrc == 3 && NumElementsDst != 3) {
 Src = ConvertVec3AndVec4(Builder, CGF, Src, 4);
-Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
-   DstTy);
+
+if (!CGF.CGM.getCodeGenOpts().PreserveVec3Type) {
+  Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
+ DstTy);
+}
+
 Src->setName("astype");
 return Src;
   }
 
   // Going from non-vec3 to vec3 is a special case and requires a bitcast
   // to vec4 if the original type is not vec4, then a shuffle vector to
   // get a vec3.
   if (NumElementsSrc != 3 && NumElementsDst == 3) {
-auto Vec4Ty = llvm::VectorType::get(DstTy->getVectorElementType(), 4);
-Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
-   Vec4Ty);
+if (!CGF.CGM.getCodeGenOpts().PreserveVec3Type) {
+  auto Vec4Ty = 

[PATCH] D30810: Preserve vec3 type.

2017-03-24 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

> I believe the argument lacks numbers (or at least you have them but didn't 
> mention). I didn't hear about performance results, or validation that this 
> was actually tested for correctness. Small test cases prove a point, but 
> can't be considered general.
> 
> OTOH, it seems like this is exactly why you want the flag, to hide any 
> potential issues and play with it. I'm not opposed to adding the flag if 
> there's a commitment to actually get the result and change the default to 
> whatever seems to be better, does that seems reasonable?

To be honest, I did not start this discussion to get better performance for 
GPU. But the goal has moved while we discussing it. I think I have showed you 
the gain from AMD GPU target on previous comments. On current version of 
clang/llvm, it generates one less load/store to preserve vec3 type instead of 
vec4 type on GPU target. But it could cause more load/store on another targets 
which have vector register because they usually expect aligned vector 
load/store. It means that if we want to change default  to vec3 on clang, we 
need to change llvm's codegen's default legalization or backend targets need to 
be modified in order to get aligned vector load/store with vec3 type. I think 
It is too big change at this moment. In the future, we could move this 
discussion to llvm-dev. Up to that time,  as you mentioned, I would like to 
hide the potential issues with vec3 type and play with it.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-23 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 updated this revision to Diff 92829.
jaykang10 added a comment.

Preserved vec3 type on __builtin_astype.


https://reviews.llvm.org/D30810

Files:
  include/clang/Driver/CC1Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGExpr.cpp
  lib/CodeGen/CGExprScalar.cpp
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenOpenCL/preserve_vec3.cl

Index: test/CodeGenOpenCL/preserve_vec3.cl
===
--- /dev/null
+++ test/CodeGenOpenCL/preserve_vec3.cl
@@ -0,0 +1,24 @@
+// RUN: %clang_cc1 %s -emit-llvm -o - -triple spir-unknown-unknown -fpreserve-vec3-type  | FileCheck %s
+
+typedef float float3 __attribute__((ext_vector_type(3)));
+typedef float float4 __attribute__((ext_vector_type(4)));
+
+void kernel foo(global float3 *a, global float3 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <3 x float>, <3 x float> addrspace(1)* %a
+  // CHECK: store <3 x float> %[[LOAD_A]], <3 x float> addrspace(1)* %b
+  *b = *a;
+}
+
+void kernel float4_to_float3(global float3 *a, global float4 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <4 x float>, <4 x float> addrspace(1)* %b, align 16
+  // CHECK: %[[ASTYPE:.*]] = shufflevector <4 x float> %[[LOAD_A]], <4 x float> undef, <3 x i32> 
+  // CHECK: store <3 x float> %[[ASTYPE:.*]], <3 x float> addrspace(1)* %a, align 16
+  *a = __builtin_astype(*b, float3);
+}
+
+void kernel float3_to_float4(global float3 *a, global float4 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <3 x float>, <3 x float> addrspace(1)* %a, align 16
+  // CHECK: %[[ASTYPE:.*]] = shufflevector <3 x float> %[[LOAD_A]], <3 x float> undef, <4 x i32> 
+  // CHECK: store <4 x float> %[[ASTYPE:.*]], <4 x float> addrspace(1)* %b, align 16
+  *b = __builtin_astype(*a, float4);
+}
Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -713,6 +713,7 @@
 }
   }
 
+  Opts.PreserveVec3Type = Args.hasArg(OPT_fpreserve_vec3_type);
   Opts.InstrumentFunctions = Args.hasArg(OPT_finstrument_functions);
   Opts.XRayInstrumentFunctions = Args.hasArg(OPT_fxray_instrument);
   Opts.XRayInstructionThreshold =
Index: lib/CodeGen/CGExprScalar.cpp
===
--- lib/CodeGen/CGExprScalar.cpp
+++ lib/CodeGen/CGExprScalar.cpp
@@ -3589,19 +3589,26 @@
   // vector to get a vec4, then a bitcast if the target type is different.
   if (NumElementsSrc == 3 && NumElementsDst != 3) {
 Src = ConvertVec3AndVec4(Builder, CGF, Src, 4);
-Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
-   DstTy);
+
+if (!CGF.CGM.getCodeGenOpts().PreserveVec3Type) {
+  Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
+ DstTy);
+}
+
 Src->setName("astype");
 return Src;
   }
 
   // Going from non-vec3 to vec3 is a special case and requires a bitcast
   // to vec4 if the original type is not vec4, then a shuffle vector to
   // get a vec3.
   if (NumElementsSrc != 3 && NumElementsDst == 3) {
-auto Vec4Ty = llvm::VectorType::get(DstTy->getVectorElementType(), 4);
-Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
-   Vec4Ty);
+if (!CGF.CGM.getCodeGenOpts().PreserveVec3Type) {
+  auto Vec4Ty = llvm::VectorType::get(DstTy->getVectorElementType(), 4);
+  Src = createCastsForTypeOfSameSize(Builder, CGF.CGM.getDataLayout(), Src,
+ Vec4Ty);
+}
+
 Src = ConvertVec3AndVec4(Builder, CGF, Src, 3);
 Src->setName("astype");
 return Src;
Index: lib/CodeGen/CGExpr.cpp
===
--- lib/CodeGen/CGExpr.cpp
+++ lib/CodeGen/CGExpr.cpp
@@ -1369,26 +1369,28 @@
QualType TBAABaseType,
uint64_t TBAAOffset,
bool isNontemporal) {
-  // For better performance, handle vector loads differently.
-  if (Ty->isVectorType()) {
-const llvm::Type *EltTy = Addr.getElementType();
-
-const auto *VTy = cast(EltTy);
-
-// Handle vectors of size 3 like size 4 for better performance.
-if (VTy->getNumElements() == 3) {
-
-  // Bitcast to vec4 type.
-  llvm::VectorType *vec4Ty = llvm::VectorType::get(VTy->getElementType(),
- 4);
-  Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
-  // Now load value.
-  llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
-
-  // Shuffle vector to get vec3.
-  V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
-  {0, 1, 2}, "extractVec");
-  

[PATCH] D30810: Preserve vec3 type.

2017-03-22 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

> Yes. This would make sense. I am guessing that in vec3->vec4, we will have 3 
> loads and 4 stores and in vec4->vec3 we will have 4 loads and 3 stores?

It depends on implementation. If you scalarize all vector operations on LLVM IR 
level before entering llvm's codegen, the vec3->vec4  would generate 3 loads 
and 4 stores and the vec4->vec3 would generate 4 loads and 3 stores. I guess 
your implementation follows this situation. The AMDGPU target keeps the vector 
form on LLVM IR level and handles it with legalization of SelectionDAG on 
llvm's codegen. On this situation, vec3->vec4 generates 4 loads and 4 stores 
because type legalizer widen vec3 load to vec4 load because the alignment is 
16. vec4->vec3 generates 4 loads and 3 stores with type legalization. The 
output could be different according to llvm's backend implementation.

> Although, I am guessing the load/store optimizations would come from your 
> current change? I don't see anything related to it in the VisitAsTypeExpr 
> implementation. But I think it might be a good idea to add this to the test 
> to make sure the IR output is as we expect.

We should implement the IRs with option. I will update patch with it.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-21 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#706677, @Anastasia wrote:

> In https://reviews.llvm.org/D30810#706676, @Anastasia wrote:
>
> > In https://reviews.llvm.org/D30810#699428, @Anastasia wrote:
> >
> > > Would you be able to update ScalarExprEmitter::VisitAsTypeExpr 
> > > implementation accordingly to make things consistent?
> >
> >
> > Not sure it got lost somewhere... do you think you could address this too?
>
>
> I guess due to 4 element alignment the generated casts as vec4 should be fine 
> for vec3, but it is worth checking...


Let's look at examples.

Source code

  void kernel float4_to_float3(global float3 *a, global float4 *b) {
//float4 --> float3
*a = __builtin_astype(*b, float3);
  }
  
  void kernel float3_to_float4(global float3 *a, global float4 *b) {
//float3 --> float4
*b = __builtin_astype(*a, float4);
  }

LLVM IR

  ; Function Attrs: norecurse nounwind
  define void @float4_to_float3(<3 x float>* nocapture %a, <4 x float>* 
nocapture readonly %b) {
  entry:
%0 = load <4 x float>, <4 x float>* %b, align 16, !tbaa !6
%storetmp = bitcast <3 x float>* %a to <4 x float>*
store <4 x float> %0, <4 x float>* %storetmp, align 16, !tbaa !6
ret void
  }
  
  ; Function Attrs: norecurse nounwind
  define void @float3_to_float4(<3 x float>* nocapture readonly %a, <4 x 
float>* nocapture %b) {
  entry:
%castToVec4 = bitcast <3 x float>* %a to <4 x float>*
%loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16
store <4 x float> %loadVec4, <4 x float>* %b, align 16, !tbaa !6
ret void
  }

We could change above IR with vec3 as following:

  ; Function Attrs: norecurse nounwinddefine void @float4_to_float3(<3 x 
float>* nocapture %a, <4 x float>* nocapture readonly %b) {
  entry:
%0 = load <4 x float>, <4 x float>* %b, align 16
%1 = shufflevector <4 x float> %0, <4 x float> undef, <3 x i32>
store <3 x float> %1, <3 x float>* %a, align 16
ret void
  }
  
  ; Function Attrs: norecurse nounwinddefine void @float3_to_float4(<3 x 
float>* nocapture readonly %a, <4 x float>* nocapture %b){
  entry:
%0 = load <3 x float>, <3 x float>* %a, align 16  %1 = shufflevector <3 x 
float> %0, <3 x float> undef, <4 x i32> 
store <4 x float> %1, <4 x float>* %b, align 16
ret void
  }

I guess it is what you want on LLVM IR. I have compiled above IRs with amdgcn 
target.

vec4 float4_to_float3 assembly code

  float4_to_float3:   ; @float4_to_float3
  ; BB#0: ; %entry
  s_load_dword s2, s[0:1], 0x9 
  s_load_dword s0, s[0:1], 0xa 
  s_mov_b32 s4, SCRATCH_RSRC_DWORD0
  s_mov_b32 s5, SCRATCH_RSRC_DWORD1
  s_mov_b32 s6, -1
  s_mov_b32 s8, s3
  s_mov_b32 s7, 0xe8f000
  s_waitcnt lgkmcnt(0)
  v_mov_b32_e32 v0, s0
  buffer_load_dword v2, v0, s[4:7], s8 offen
  buffer_load_dword v3, v0, s[4:7], s8 offen offset:4
  buffer_load_dword v4, v0, s[4:7], s8 offen offset:8
  buffer_load_dword v0, v0, s[4:7], s8 offen offset:12
  v_mov_b32_e32 v1, s2
  s_waitcnt vmcnt(0)
  buffer_store_dword v0, v1, s[4:7], s8 offen offset:12
  buffer_store_dword v4, v1, s[4:7], s8 offen offset:8
  buffer_store_dword v3, v1, s[4:7], s8 offen offset:4
  buffer_store_dword v2, v1, s[4:7], s8 offen
  s_endpgm

vec3 float4_to_float3 assembly code

  float4_to_float3:   ; @float4_to_float3
  ; BB#0: ; %entry
  s_load_dword s2, s[0:1], 0x9 
  s_load_dword s0, s[0:1], 0xa 
  s_mov_b32 s4, SCRATCH_RSRC_DWORD0
  s_mov_b32 s5, SCRATCH_RSRC_DWORD1
  s_mov_b32 s6, -1
  s_mov_b32 s8, s3
  s_mov_b32 s7, 0xe8f000
  s_waitcnt lgkmcnt(0)
  v_mov_b32_e32 v0, s0
  buffer_load_dword v2, v0, s[4:7], s8 offen
  buffer_load_dword v3, v0, s[4:7], s8 offen offset:8
  buffer_load_dword v0, v0, s[4:7], s8 offen offset:4
  v_mov_b32_e32 v1, s2
  s_waitcnt vmcnt(0)
  buffer_store_dword v0, v1, s[4:7], s8 offen offset:4
  buffer_store_dword v2, v1, s[4:7], s8 offen
  buffer_store_dword v3, v1, s[4:7], s8 offen offset:8
  s_endpgm

I think it is what we expect because there are 3 load/store for vec instead of 
4 load/store.

The float3_to_float4's output is different with float4_to_float3.

vec4 float3_to_float4 assembly code

  float3_to_float4:   ; @float3_to_float4
  ; BB#0: ; %entry
  s_load_dword s2, s[0:1], 0x9
  s_mov_b32 s4, SCRATCH_RSRC_DWORD0
  s_mov_b32 s5, SCRATCH_RSRC_DWORD1
  s_mov_b32 s6, -1
  s_mov_b32 s8, s3
  s_mov_b32 s7, 0xe8f000
  s_waitcnt lgkmcnt(0)
  v_mov_b32_e32 v0, s2
  buffer_load_dword v2, v0, s[4:7], s8 offen
  

[PATCH] D30810: Preserve vec3 type.

2017-03-21 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#702637, @jaykang10 wrote:

> In https://reviews.llvm.org/D30810#702614, @Anastasia wrote:
>
> > In https://reviews.llvm.org/D30810#702443, @bruno wrote:
> >
> > > > As a result, I think it would be good for clang to have both of 
> > > > features and I would like to stick to the option "-fpresereve-vec3' to 
> > > > change the behavior easily.
> > >
> > > The motivation doesn't seem solid to me, who else is going to benefit 
> > > from this flag?
> >
> >
> > There are some off the main tree implementation that would benefit. But in 
> > the case of AMD GPU 3 loads/stores will be produced instead of 4. Sounds 
> > like a good optimization to me. As I said in my previous comment I think it 
> > should have been the default behavior from the beginning, but since 
> > different implementation landed first we can integrate this one now with an 
> > additional option.
>
>
> Additionally, Here is assembly output from vec3 with amdgcn target. :)
>
> LLVM IR
>
>   define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
> %b) {
>   entry:
> %0 = load <3 x float>, <3 x float>* %b, align 16
> store <3 x float> %0, <3 x float>* %a, align 16
> ret void
>   }
>
>
> Assembly Output
>
>   .text
>   .section.AMDGPU.config
>   .long   47176
>   .long   11272256
>   .long   47180
>   .long   132
>   .long   47200
>   .long   0
>   .long   4
>   .long   0
>   .long   8
>   .long   0
>   .text
>   .globl  foo
>   .p2align8
>   .type   foo,@function
>   foo:; @foo
>   ; BB#0: ; %entry
>   s_load_dword s2, s[0:1], 0x9
>   s_load_dword s0, s[0:1], 0xa
>   s_mov_b32 s4, SCRATCH_RSRC_DWORD0
>   s_mov_b32 s5, SCRATCH_RSRC_DWORD1
>   s_mov_b32 s6, -1
>   s_mov_b32 s8, s3
>   s_mov_b32 s7, 0xe8f000
>   s_waitcnt lgkmcnt(0)
>   v_mov_b32_e32 v0, s0
>   buffer_load_dword v2, v0, s[4:7], s8 offen
>   buffer_load_dword v3, v0, s[4:7], s8 offen offset:8
>   buffer_load_dword v0, v0, s[4:7], s8 offen offset:4
>   v_mov_b32_e32 v1, s2
>   s_waitcnt vmcnt(0)
>   buffer_store_dword v0, v1, s[4:7], s8 offen offset:4
>   buffer_store_dword v2, v1, s[4:7], s8 offen
>   buffer_store_dword v3, v1, s[4:7], s8 offen offset:8
>   s_endpgm
>   .Lfunc_end0:
>   .size   foo, .Lfunc_end0-foo
>  
>   .section.AMDGPU.csdata
>   ; Kernel info:
>   ; codeLenInByte = 112
>   ; NumSgprs: 9
>   ; NumVgprs: 4
>   ; FloatMode: 192
>   ; IeeeMode: 1
>   ; ScratchSize: 0
>   ; LDSByteSize: 0 bytes/workgroup (compile time only)
>   ; SGPRBlocks: 1
>   ; VGPRBlocks: 0
>   ; NumSGPRsForWavesPerEU: 9
>   ; NumVGPRsForWavesPerEU: 4
>   ; ReservedVGPRFirst: 0
>   ; ReservedVGPRCount: 0
>   ; COMPUTE_PGM_RSRC2:USER_SGPR: 2
>   ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0
>   ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1
>   ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0
>   ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0
>   ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0
>  
>   .section".note.GNU-stack"
>


Hi Anastasia, Bruno,

Do you still have other opinion? or Can we go ahead and commit this patch?


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-16 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#702614, @Anastasia wrote:

> In https://reviews.llvm.org/D30810#702443, @bruno wrote:
>
> > > As a result, I think it would be good for clang to have both of features 
> > > and I would like to stick to the option "-fpresereve-vec3' to change the 
> > > behavior easily.
> >
> > The motivation doesn't seem solid to me, who else is going to benefit from 
> > this flag?
>
>
> There are some off the main tree implementation that would benefit. But in 
> the case of AMD GPU 3 loads/stores will be produced instead of 4. Sounds like 
> a good optimization to me. As I said in my previous comment I think it should 
> have been the default behavior from the beginning, but since different 
> implementation landed first we can integrate this one now with an additional 
> option.


Additionally, Here is assembly output from vec3 with amdgcn target. :)

LLVM IR

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
%b) {
  entry:
%0 = load <3 x float>, <3 x float>* %b, align 16
store <3 x float> %0, <3 x float>* %a, align 16
ret void
  }

Assembly Output

.text
.section.AMDGPU.config
.long   47176
.long   11272256
.long   47180
.long   132
.long   47200
.long   0
.long   4
.long   0
.long   8
.long   0
.text
.globl  foo
.p2align8
.type   foo,@function
  foo:; @foo
  ; BB#0: ; %entry
s_load_dword s2, s[0:1], 0x9
s_load_dword s0, s[0:1], 0xa
s_mov_b32 s4, SCRATCH_RSRC_DWORD0
s_mov_b32 s5, SCRATCH_RSRC_DWORD1
s_mov_b32 s6, -1
s_mov_b32 s8, s3
s_mov_b32 s7, 0xe8f000
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v0, s0
buffer_load_dword v2, v0, s[4:7], s8 offen
buffer_load_dword v3, v0, s[4:7], s8 offen offset:8
buffer_load_dword v0, v0, s[4:7], s8 offen offset:4
v_mov_b32_e32 v1, s2
s_waitcnt vmcnt(0)
buffer_store_dword v0, v1, s[4:7], s8 offen offset:4
buffer_store_dword v2, v1, s[4:7], s8 offen
buffer_store_dword v3, v1, s[4:7], s8 offen offset:8
s_endpgm
  .Lfunc_end0:
.size   foo, .Lfunc_end0-foo
  
.section.AMDGPU.csdata
  ; Kernel info:
  ; codeLenInByte = 112
  ; NumSgprs: 9
  ; NumVgprs: 4
  ; FloatMode: 192
  ; IeeeMode: 1
  ; ScratchSize: 0
  ; LDSByteSize: 0 bytes/workgroup (compile time only)
  ; SGPRBlocks: 1
  ; VGPRBlocks: 0
  ; NumSGPRsForWavesPerEU: 9
  ; NumVGPRsForWavesPerEU: 4
  ; ReservedVGPRFirst: 0
  ; ReservedVGPRCount: 0
  ; COMPUTE_PGM_RSRC2:USER_SGPR: 2
  ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0
  ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1
  ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0
  ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0
  ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0
  
.section".note.GNU-stack"


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-16 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#702548, @ahatanak wrote:

> In https://reviews.llvm.org/D30810#701141, @jaykang10 wrote:
>
> > In https://reviews.llvm.org/D30810#701132, @ahatanak wrote:
> >
> > > Actually, it's not a mis-compile. The record layout shows that there is a 
> > > padding before field f2 and f2 starts at byte 16. So using "store <4 x 
> > > float>" doesn't overwrite the field.
> >
> >
> > It depends on float3's alignment. I guess the float3's alignment is 16 on 
> > your target so there is a padding bytes before f2 to be aligned by 16. If 
> > you add __attribute__((packed)) on struct type, you could see overwrite.
>
>
> I looked at ItaniumRecordLayoutBuilder to understand how 
> __attribute__((packed)) affects the struct's layout.
>
> If the number of elements of a vector is not a power of 2, 
> ASTContext::getTypeInfoImpl rounds up the width and alignment to the next 
> power of 2. So the size of float3 is 16 bytes. __attribute__((packed)) can 
> change the alignment of float3, but it doesn't change its size, so there is a 
> 4-byte padding between field f1 and f2.


Oops, I am sorry,  I was wrong. I did not imagine the size has same behavior 
with alignment. Thank you for pointing out it! :) Additionally, I wonder why 
the non-power-of-2 length vector's size has same behavior with alignment... Do 
you know something about the reason?


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-15 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

> The motivation doesn't seem solid to me, who else is going to benefit from 
> this flag? You also didn't explain why doing this transformation yourself 
> (looking through the shuffle) on your downstream pass isn't enough for you. 
> We generally try to avoid adding flags if not for a good reason.

If you turn on optimization with -O option, the shufflevector is gone on 
'InstructionCombining' pass. I showed you the IR example and source code on 
previous comment. You could mention bitcast instead of shufflevector. The 
bitcast could be constant expression with global variables and it makes code 
dirty. I think Anastasia has already explained the motivation well on her 
comments. If you need something more, please let me know.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-15 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#701861, @Anastasia wrote:

> I don't think there is anything wrong with the generation of vec3->vec4 in 
> Clang. I believe the motivation for this was the OpenCL spec treating vec3 as 
> vec4 aligned type (see section 6.1.5: 
> https://www.khronos.org/registry/OpenCL/specs/opencl-2.0-openclc.pdf#12). So 
> in terms of memory layout vec3 wouldn't be any different to vec4. But in 
> terms of operations (including loads/stores) there can be potential gain from 
> not handling the 4th element.  This can be exploited by some targets. I think 
> generating the vec3 from the frontend would be a better choice in the first 
> place. Because backend can decide how to handle this. Including for 
> architectures with no SIMD support it would just generate 3 separate 
> loads/stores. Right now it seems that it will be forced to generate 4 
> loads/stores.
>
> But considering that transformation to vec4 has been the default 
> implementation for quite a while in the frontend, I think we would need a 
> stronger motivation for switching to original vec3. So current approach with 
> a special flag for preserving vec3 should be good enough to fit all needs.


Thank you, Anastasia. Can we go ahead and commit this patch?


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-15 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.



> I believe the assumption is more practical: most part of upstream llvm 
> targets only support vectors with even sized number of lanes. And in those 
> cases you would have to expand to a 4x vector and leave the 4th element as 
> undef anyway, so it was done in the front-end to get rid of it right away. 
> Probably GPU targets do some special tricks here during legalization.

I have compiled below code, which current clang generates for vec3,  using llc 
with amdgcn target.

LLVM IR (vec3 --> vec4)

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
%b) {
  entry:
%castToVec4 = bitcast <3 x float>* %b to <4 x float>*
%loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16
%storetmp = bitcast <3 x float>* %a to <4 x float>*
store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16
ret void
  }

SelectionDAG after legalization.

  Legalized selection DAG: BB#0 'foo:entry'
  SelectionDAG has 43 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
  t9: i64 = add t2, Constant:i64<40>
t10: i32,ch = 
load t0, t9, 
undef:i64
  t4: i64 = add t2, Constant:i64<36>
t6: i32,ch = 
load t0, t4, 
undef:i64
t12: ch = TokenFactor t6:1, t10:1
t32: v4i32 = BUILD_VECTOR t22, t25, t27, t29
  t21: v4f32 = bitcast t32
t18: v4i32 = bitcast t21
t22: i32,ch = load t12, t10, undef:i32
t24: i32 = add t10, Constant:i32<4>
t25: i32,ch = load t12, t24, undef:i32
t26: i32 = add t24, Constant:i32<4>
t27: i32,ch = load t12, t26, undef:i32
  t28: i32 = add t26, Constant:i32<4>
t29: i32,ch = load t12, t28, undef:i32
t31: ch = TokenFactor t22:1, t25:1, t27:1, t29:1
  t35: i32 = extract_vector_elt t18, Constant:i32<0>
t36: ch = store t31, t35, t6, undef:i32
  t38: i32 = extract_vector_elt t18, Constant:i32<1>
  t39: i32 = add t6, Constant:i32<4>
t40: ch = store t31, t38, t39, undef:i32
  t42: i32 = extract_vector_elt t18, Constant:i32<2>
  t44: i32 = add t6, Constant:i32<8>
t45: ch = store t31, t42, t44, undef:i32
  t47: i32 = extract_vector_elt t18, Constant:i32<3>
  t49: i32 = add t6, Constant:i32<12>
t50: ch = store t31, t47, t49, undef:i32
  t51: ch = TokenFactor t36, t40, t45, t50
t17: ch = ENDPGM t51

As you can see, the SelectionDAG still has 4 load/store.

Assembly output

.section.AMDGPU.config
.long   47176
.long   11272257
.long   47180
.long   132
.long   47200
.long   0
.long   4
.long   0
.long   8
.long   0
.text
.globl  foo
.p2align8
.type   foo,@function
  foo:; @foo
  ; BB#0: ; %entry
s_load_dword s2, s[0:1], 0x9
s_load_dword s0, s[0:1], 0xa
s_mov_b32 s4, SCRATCH_RSRC_DWORD0
s_mov_b32 s5, SCRATCH_RSRC_DWORD1
s_mov_b32 s6, -1
s_mov_b32 s8, s3
s_mov_b32 s7, 0xe8f000
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v0, s0
buffer_load_dword v2, v0, s[4:7], s8 offen
buffer_load_dword v3, v0, s[4:7], s8 offen offset:4
buffer_load_dword v4, v0, s[4:7], s8 offen offset:8
buffer_load_dword v0, v0, s[4:7], s8 offen offset:12
v_mov_b32_e32 v1, s2
s_waitcnt vmcnt(0)
buffer_store_dword v0, v1, s[4:7], s8 offen offset:12
buffer_store_dword v4, v1, s[4:7], s8 offen offset:8
buffer_store_dword v3, v1, s[4:7], s8 offen offset:4
buffer_store_dword v2, v1, s[4:7], s8 offen
s_endpgm
  .Lfunc_end0:
.size   foo, .Lfunc_end0-foo
  
.section.AMDGPU.csdata

As you can see, assembly also has 4 load/store. I think gpu target does not 
handle it specially.

> My guess here is that targets that do care are looking through the vector 
> shuffle and customizing to whatever seems the best to them. If you wanna 
> change the default behavior you need some data to show that your model solves 
> a real issue and actually brings benefits; do you have any real examples on 
> where that happens, or why GPU targets haven't yet tried to change this? 
> Maybe other custom front-ends based on clang do? Finding the historical 
> reason (if any) should be a good start point.

I did "git blame" and I read the commit's message. You can also see it with 
"c58dcdc8facb646d88675bb6fbcb5c787166c4be". It is same with clang code's 
comment. I also wonder how the vec3 --> vec4 load/store has not caused 
problems. 

[PATCH] D30810: Preserve vec3 type.

2017-03-14 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#701132, @ahatanak wrote:

> Actually, it's not a mis-compile. The record layout shows that there is a 
> padding before field f2 and f2 starts at byte 16. So using "store <4 x 
> float>" doesn't overwrite the field.


It depends on float3's alignment. I guess the float3's alignment is 16 on your 
target so there is a padding bytes before f2 to be aligned by 16. If you add 
__attribute__((packed)) on struct type, you could see overwrite.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-14 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#700476, @Anastasia wrote:

> In https://reviews.llvm.org/D30810#699827, @jaykang10 wrote:
>
> > In https://reviews.llvm.org/D30810#699695, @bruno wrote:
> >
> > > Hi JinGu,
> > >
> > > I just read the discussion on cfe-dev, some comments:
> > >
> > > > My colleague and I are implementing a transformation pass between LLVM 
> > > > IR and another IR and we want to keep the 3-component vector types in 
> > > > our target IR
> > >
> > > Why can't you go back through the shuffle to find out that it comes from 
> > > a vec3 -> vec4 transformation? Adding a flag here just for that seems 
> > > very odd.
> >
> >
> > Hi Bruno,
> >
> > Thanks for your comment. We have a pass to undo the vec4 to vec3. I 
> > wondered why clang generates the vec4 for vec3 load/store. As you can see 
> > the comment on clang's code, they are generated for better performance. I 
> > had a questions at this point. clang should consider vector load/store 
> > aligned by 4 regardless of target??? llvm's codegen could handle vec3 
> > according to targets' vector load/store with their alignment. I agree the 
> > flag looks odd but I was concerned some llvm's targets depend on the vec4 
> > so I suggested to add the flag. If I missed something, please let me know.
>
>
> I think doing this transformation in a frontend was not the best choice 
> because we are losing the source information too early. But some targets can 
> offer a better support for vec3 natively than converting to vec4. Regarding 
> the option, I am wondering whether adding this as a part of TargetInfo would 
> be a better solution. Although, I don't think any currently available 
> upstream targets support this.




In https://reviews.llvm.org/D30810#700476, @Anastasia wrote:

> In https://reviews.llvm.org/D30810#699827, @jaykang10 wrote:
>
> > In https://reviews.llvm.org/D30810#699695, @bruno wrote:
> >
> > > Hi JinGu,
> > >
> > > I just read the discussion on cfe-dev, some comments:
> > >
> > > > My colleague and I are implementing a transformation pass between LLVM 
> > > > IR and another IR and we want to keep the 3-component vector types in 
> > > > our target IR
> > >
> > > Why can't you go back through the shuffle to find out that it comes from 
> > > a vec3 -> vec4 transformation? Adding a flag here just for that seems 
> > > very odd.
> >
> >
> > Hi Bruno,
> >
> > Thanks for your comment. We have a pass to undo the vec4 to vec3. I 
> > wondered why clang generates the vec4 for vec3 load/store. As you can see 
> > the comment on clang's code, they are generated for better performance. I 
> > had a questions at this point. clang should consider vector load/store 
> > aligned by 4 regardless of target??? llvm's codegen could handle vec3 
> > according to targets' vector load/store with their alignment. I agree the 
> > flag looks odd but I was concerned some llvm's targets depend on the vec4 
> > so I suggested to add the flag. If I missed something, please let me know.
>
>
> I think doing this transformation in a frontend was not the best choice 
> because we are losing the source information too early. But some targets can 
> offer a better support for vec3 natively than converting to vec4. Regarding 
> the option, I am wondering whether adding this as a part of TargetInfo would 
> be a better solution. Although, I don't think any currently available 
> upstream targets support this.


I have a fundamental question for vec3 load/store again. I have checked how 
llvm's codegen handles the vec3 and vec4 load/store. Assuming that target's 
action is widen for vector type unsupported. If they have same alignment, the 
llvm's codegen generates one vector load for both of them. But it handles store 
differently  between vec3 and vec4  on DAGTypeLegalizer because vec4 store 
writes place over vec3 type. For example,

source code

  typedef float float3 __attribute__((ext_vector_type(3)));
  
  void foo(float3 *a, float3 *b) {
*a = *b; 
  }

LLVM IR for vec4 (it is original clang's output)

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
%b) {
  entry:
%castToVec4 = bitcast <3 x float>* %b to <4 x float>*
%loadVec4 = load <4 x float>, <4 x float>* %castToVec4, align 16
%storetmp = bitcast <3 x float>* %a to <4 x float>*
store <4 x float> %loadVec4, <4 x float>* %storetmp, align 16
ret void
  }

LLVM IR for vec3

  define void @foo(<3 x float>* nocapture %a, <3 x float>* nocapture readonly 
%b) {
  entry:
%0 = load <3 x float>, <3 x float>* %b, align 16
store <3 x float> %0, <3 x float>* %a, align 16
ret void
  }

As you can see above IR for vec4, "store <4 x float> %loadVec4, <4 x float>* 
%storetmp, align 16" writes place beyond float3. Is the behavior correct??? I 
can not find which spec or document mentions it... The vec3 store are lowered 
to more instructions such as stores and extractelements but the behavior is 
correct. If the vec3 --> 

[PATCH] D30810: Preserve vec3 type.

2017-03-13 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#699695, @bruno wrote:

> Hi JinGu,
>
> I just read the discussion on cfe-dev, some comments:
>
> > My colleague and I are implementing a transformation pass between LLVM IR 
> > and another IR and we want to keep the 3-component vector types in our 
> > target IR
>
> Why can't you go back through the shuffle to find out that it comes from a 
> vec3 -> vec4 transformation? Adding a flag here just for that seems very odd.


Hi Bruno,

Thanks for your comment. We have a pass to undo the vec4 to vec3. I wondered 
why clang generates the vec4 for vec3 load/store. As you can see the comment on 
clang's code, they are generated for better performance. I had a questions at 
this point. clang should consider vector load/store aligned by 4 regardless of 
target??? llvm's codegen could handle vec3 according to targets' vector 
load/store with their alignment. I agree the flag looks odd but I was concerned 
some llvm's targets depend on the vec4 so I suggested to add the flag. If I 
missed something, please let me know.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-13 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#699428, @Anastasia wrote:

> Would you be able to update ScalarExprEmitter::VisitAsTypeExpr implementation 
> accordingly to make things consistent?


Probably, No... the load/store with vec3 generates vec4 temporarily to be 
aligned but __builtin_astype needs to generate vec4 from vec3  or vice versa 
because spec allows it on 6.2.4.2. As we discussed on previous e-mails, we 
would need to add intrinsic function for __builtin_astype on llvm and add 
default behavior on llvm's codegen. It is beyond clang and I am not sure llvm's 
IR and codegen people allow it... If you have another idea or I missed 
something, please let me know.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-13 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 marked an inline comment as done.
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#699006, @ahatanak wrote:

> Could you elaborate on the motivation for this change?
>
> I was wondering why clang (CodeGen) needed the help of a command line option 
> to decide whether vec3 should be converted to vec4. Can it just preserve vec3 
> when the architecture is spir?


Ah, sorry... I forgot to link the discussion cfe-dev. 
http://lists.llvm.org/pipermail/cfe-dev/2017-March/052956.html I hope it is 
helpful for you.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-10 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 added a comment.

In https://reviews.llvm.org/D30810#697760, @Anastasia wrote:

> Could you please add your test here (probably goes to test/CodeGenOpenCL)?


Oops!
I am so sorry. I missed it. I have updated it.


https://reviews.llvm.org/D30810



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D30810: Preserve vec3 type.

2017-03-10 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 updated this revision to Diff 91355.
jaykang10 added a comment.

Changed help text for option and Added test file.


https://reviews.llvm.org/D30810

Files:
  include/clang/Driver/CC1Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGExpr.cpp
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenOpenCL/preserve_vec3.cl

Index: test/CodeGenOpenCL/preserve_vec3.cl
===
--- /dev/null
+++ test/CodeGenOpenCL/preserve_vec3.cl
@@ -0,0 +1,10 @@
+// RUN: %clang_cc1 %s -emit-llvm -o - -triple spir-unknown-unknown -preserve-vec3-type  | FileCheck %s
+
+typedef float float3 __attribute__((ext_vector_type(3)));
+typedef float float4 __attribute__((ext_vector_type(4)));
+
+void kernel foo(global float3 *a, global float3 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <3 x float>, <3 x float> addrspace(1)* %a
+  // CHECK: store <3 x float> %[[LOAD_A]], <3 x float> addrspace(1)* %b
+  *b = *a;
+}
Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -713,6 +713,7 @@
 }
   }
 
+  Opts.PreserveVec3Type = Args.hasArg(OPT_fpreserve_vec3_type);
   Opts.InstrumentFunctions = Args.hasArg(OPT_finstrument_functions);
   Opts.XRayInstrumentFunctions = Args.hasArg(OPT_fxray_instrument);
   Opts.XRayInstructionThreshold =
Index: lib/CodeGen/CGExpr.cpp
===
--- lib/CodeGen/CGExpr.cpp
+++ lib/CodeGen/CGExpr.cpp
@@ -1368,26 +1368,28 @@
QualType TBAABaseType,
uint64_t TBAAOffset,
bool isNontemporal) {
-  // For better performance, handle vector loads differently.
-  if (Ty->isVectorType()) {
-const llvm::Type *EltTy = Addr.getElementType();
-
-const auto *VTy = cast(EltTy);
-
-// Handle vectors of size 3 like size 4 for better performance.
-if (VTy->getNumElements() == 3) {
-
-  // Bitcast to vec4 type.
-  llvm::VectorType *vec4Ty = llvm::VectorType::get(VTy->getElementType(),
- 4);
-  Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
-  // Now load value.
-  llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
-
-  // Shuffle vector to get vec3.
-  V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
-  {0, 1, 2}, "extractVec");
-  return EmitFromMemory(V, Ty);
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+// For better performance, handle vector loads differently.
+if (Ty->isVectorType()) {
+  const llvm::Type *EltTy = Addr.getElementType();
+
+  const auto *VTy = cast(EltTy);
+
+  // Handle vectors of size 3 like size 4 for better performance.
+  if (VTy->getNumElements() == 3) {
+
+// Bitcast to vec4 type.
+llvm::VectorType *vec4Ty =
+llvm::VectorType::get(VTy->getElementType(), 4);
+Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
+// Now load value.
+llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
+
+// Shuffle vector to get vec3.
+V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
+{0, 1, 2}, "extractVec");
+return EmitFromMemory(V, Ty);
+  }
 }
   }
 
@@ -1456,23 +1458,24 @@
 bool isNontemporal) {
 
   // Handle vectors differently to get better performance.
-  if (Ty->isVectorType()) {
-llvm::Type *SrcTy = Value->getType();
-auto *VecTy = cast(SrcTy);
-// Handle vec3 special.
-if (VecTy->getNumElements() == 3) {
-  // Our source is a vec3, do a shuffle vector to make it a vec4.
-  llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
-Builder.getInt32(2),
-llvm::UndefValue::get(Builder.getInt32Ty())};
-  llvm::Value *MaskV = llvm::ConstantVector::get(Mask);
-  Value = Builder.CreateShuffleVector(Value,
-  llvm::UndefValue::get(VecTy),
-  MaskV, "extractVec");
-  SrcTy = llvm::VectorType::get(VecTy->getElementType(), 4);
-}
-if (Addr.getElementType() != SrcTy) {
-  Addr = Builder.CreateElementBitCast(Addr, SrcTy, "storetmp");
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+if (Ty->isVectorType()) {
+  llvm::Type *SrcTy = Value->getType();
+  auto *VecTy = cast(SrcTy);
+  // Handle vec3 special.
+  if (VecTy->getNumElements() == 3) {
+// Our source is a vec3, do a shuffle vector to make it a vec4.
+llvm::Constant *Mask[] = 

[PATCH] D30810: Preserve vec3 type.

2017-03-10 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 updated this revision to Diff 91307.
jaykang10 added a comment.

Fixed typo.


https://reviews.llvm.org/D30810

Files:
  include/clang/Driver/CC1Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGExpr.cpp
  lib/Frontend/CompilerInvocation.cpp

Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -713,6 +713,7 @@
 }
   }
 
+  Opts.PreserveVec3Type = Args.hasArg(OPT_fpreserve_vec3_type);
   Opts.InstrumentFunctions = Args.hasArg(OPT_finstrument_functions);
   Opts.XRayInstrumentFunctions = Args.hasArg(OPT_fxray_instrument);
   Opts.XRayInstructionThreshold =
Index: lib/CodeGen/CGExpr.cpp
===
--- lib/CodeGen/CGExpr.cpp
+++ lib/CodeGen/CGExpr.cpp
@@ -1368,26 +1368,28 @@
QualType TBAABaseType,
uint64_t TBAAOffset,
bool isNontemporal) {
-  // For better performance, handle vector loads differently.
-  if (Ty->isVectorType()) {
-const llvm::Type *EltTy = Addr.getElementType();
-
-const auto *VTy = cast(EltTy);
-
-// Handle vectors of size 3 like size 4 for better performance.
-if (VTy->getNumElements() == 3) {
-
-  // Bitcast to vec4 type.
-  llvm::VectorType *vec4Ty = llvm::VectorType::get(VTy->getElementType(),
- 4);
-  Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
-  // Now load value.
-  llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
-
-  // Shuffle vector to get vec3.
-  V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
-  {0, 1, 2}, "extractVec");
-  return EmitFromMemory(V, Ty);
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+// For better performance, handle vector loads differently.
+if (Ty->isVectorType()) {
+  const llvm::Type *EltTy = Addr.getElementType();
+
+  const auto *VTy = cast(EltTy);
+
+  // Handle vectors of size 3 like size 4 for better performance.
+  if (VTy->getNumElements() == 3) {
+
+// Bitcast to vec4 type.
+llvm::VectorType *vec4Ty =
+llvm::VectorType::get(VTy->getElementType(), 4);
+Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
+// Now load value.
+llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
+
+// Shuffle vector to get vec3.
+V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
+{0, 1, 2}, "extractVec");
+return EmitFromMemory(V, Ty);
+  }
 }
   }
 
@@ -1456,23 +1458,24 @@
 bool isNontemporal) {
 
   // Handle vectors differently to get better performance.
-  if (Ty->isVectorType()) {
-llvm::Type *SrcTy = Value->getType();
-auto *VecTy = cast(SrcTy);
-// Handle vec3 special.
-if (VecTy->getNumElements() == 3) {
-  // Our source is a vec3, do a shuffle vector to make it a vec4.
-  llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
-Builder.getInt32(2),
-llvm::UndefValue::get(Builder.getInt32Ty())};
-  llvm::Value *MaskV = llvm::ConstantVector::get(Mask);
-  Value = Builder.CreateShuffleVector(Value,
-  llvm::UndefValue::get(VecTy),
-  MaskV, "extractVec");
-  SrcTy = llvm::VectorType::get(VecTy->getElementType(), 4);
-}
-if (Addr.getElementType() != SrcTy) {
-  Addr = Builder.CreateElementBitCast(Addr, SrcTy, "storetmp");
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+if (Ty->isVectorType()) {
+  llvm::Type *SrcTy = Value->getType();
+  auto *VecTy = cast(SrcTy);
+  // Handle vec3 special.
+  if (VecTy->getNumElements() == 3) {
+// Our source is a vec3, do a shuffle vector to make it a vec4.
+llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
+  Builder.getInt32(2),
+  llvm::UndefValue::get(Builder.getInt32Ty())};
+llvm::Value *MaskV = llvm::ConstantVector::get(Mask);
+Value = Builder.CreateShuffleVector(Value, llvm::UndefValue::get(VecTy),
+MaskV, "extractVec");
+SrcTy = llvm::VectorType::get(VecTy->getElementType(), 4);
+  }
+  if (Addr.getElementType() != SrcTy) {
+Addr = Builder.CreateElementBitCast(Addr, SrcTy, "storetmp");
+  }
 }
   }
 
Index: include/clang/Frontend/CodeGenOptions.def

[PATCH] D30810: Preserve vec3 type.

2017-03-10 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 updated this revision to Diff 91306.
jaykang10 added a comment.

Added -f prefix to option name.


https://reviews.llvm.org/D30810

Files:
  include/clang/Driver/CC1Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGExpr.cpp
  lib/Frontend/CompilerInvocation.cpp

Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -713,6 +713,7 @@
 }
   }
 
+  Opts.PreserveVec3Type = Args.hasArg(OPT_fpreserve_vec3_type);
   Opts.InstrumentFunctions = Args.hasArg(OPT_finstrument_functions);
   Opts.XRayInstrumentFunctions = Args.hasArg(OPT_fxray_instrument);
   Opts.XRayInstructionThreshold =
Index: lib/CodeGen/CGExpr.cpp
===
--- lib/CodeGen/CGExpr.cpp
+++ lib/CodeGen/CGExpr.cpp
@@ -1368,26 +1368,28 @@
QualType TBAABaseType,
uint64_t TBAAOffset,
bool isNontemporal) {
-  // For better performance, handle vector loads differently.
-  if (Ty->isVectorType()) {
-const llvm::Type *EltTy = Addr.getElementType();
-
-const auto *VTy = cast(EltTy);
-
-// Handle vectors of size 3 like size 4 for better performance.
-if (VTy->getNumElements() == 3) {
-
-  // Bitcast to vec4 type.
-  llvm::VectorType *vec4Ty = llvm::VectorType::get(VTy->getElementType(),
- 4);
-  Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
-  // Now load value.
-  llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
-
-  // Shuffle vector to get vec3.
-  V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
-  {0, 1, 2}, "extractVec");
-  return EmitFromMemory(V, Ty);
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+// For better performance, handle vector loads differently.
+if (Ty->isVectorType()) {
+  const llvm::Type *EltTy = Addr.getElementType();
+
+  const auto *VTy = cast(EltTy);
+
+  // Handle vectors of size 3 like size 4 for better performance.
+  if (VTy->getNumElements() == 3) {
+
+// Bitcast to vec4 type.
+llvm::VectorType *vec4Ty =
+llvm::VectorType::get(VTy->getElementType(), 4);
+Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
+// Now load value.
+llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
+
+// Shuffle vector to get vec3.
+V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
+{0, 1, 2}, "extractVec");
+return EmitFromMemory(V, Ty);
+  }
 }
   }
 
@@ -1456,23 +1458,24 @@
 bool isNontemporal) {
 
   // Handle vectors differently to get better performance.
-  if (Ty->isVectorType()) {
-llvm::Type *SrcTy = Value->getType();
-auto *VecTy = cast(SrcTy);
-// Handle vec3 special.
-if (VecTy->getNumElements() == 3) {
-  // Our source is a vec3, do a shuffle vector to make it a vec4.
-  llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
-Builder.getInt32(2),
-llvm::UndefValue::get(Builder.getInt32Ty())};
-  llvm::Value *MaskV = llvm::ConstantVector::get(Mask);
-  Value = Builder.CreateShuffleVector(Value,
-  llvm::UndefValue::get(VecTy),
-  MaskV, "extractVec");
-  SrcTy = llvm::VectorType::get(VecTy->getElementType(), 4);
-}
-if (Addr.getElementType() != SrcTy) {
-  Addr = Builder.CreateElementBitCast(Addr, SrcTy, "storetmp");
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+if (Ty->isVectorType()) {
+  llvm::Type *SrcTy = Value->getType();
+  auto *VecTy = cast(SrcTy);
+  // Handle vec3 special.
+  if (VecTy->getNumElements() == 3) {
+// Our source is a vec3, do a shuffle vector to make it a vec4.
+llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
+  Builder.getInt32(2),
+  llvm::UndefValue::get(Builder.getInt32Ty())};
+llvm::Value *MaskV = llvm::ConstantVector::get(Mask);
+Value = Builder.CreateShuffleVector(Value, llvm::UndefValue::get(VecTy),
+MaskV, "extractVec");
+SrcTy = llvm::VectorType::get(VecTy->getElementType(), 4);
+  }
+  if (Addr.getElementType() != SrcTy) {
+Addr = Builder.CreateElementBitCast(Addr, SrcTy, "storetmp");
+  }
 }
   }
 
Index: include/clang/Frontend/CodeGenOptions.def

[PATCH] D30810: Preserve vec3 type.

2017-03-09 Thread JinGu Kang via Phabricator via cfe-commits
jaykang10 created this revision.

Preserve vec3 type with CodeGen option.


https://reviews.llvm.org/D30810

Files:
  include/clang/Driver/CC1Options.td
  include/clang/Frontend/CodeGenOptions.def
  lib/CodeGen/CGExpr.cpp
  lib/Frontend/CompilerInvocation.cpp
  test/CodeGenOpenCL/preserve_vec3.cl

Index: test/CodeGenOpenCL/preserve_vec3.cl
===
--- /dev/null
+++ test/CodeGenOpenCL/preserve_vec3.cl
@@ -0,0 +1,10 @@
+// RUN: %clang_cc1 %s -emit-llvm -o - -triple spir-unknown-unknown -preserve-vec3-type  | FileCheck %s
+
+typedef float float3 __attribute__((ext_vector_type(3)));
+typedef float float4 __attribute__((ext_vector_type(4)));
+
+void kernel foo(global float3 *a, global float3 *b) {
+  // CHECK: %[[LOAD_A:.*]] = load <3 x float>, <3 x float> addrspace(1)* %a
+  // CHECK: store <3 x float> %[[LOAD_A]], <3 x float> addrspace(1)* %b
+  *b = *a;
+}
Index: lib/Frontend/CompilerInvocation.cpp
===
--- lib/Frontend/CompilerInvocation.cpp
+++ lib/Frontend/CompilerInvocation.cpp
@@ -713,6 +713,7 @@
 }
   }
 
+  Opts.PreserveVec3Type = Args.hasArg(OPT_preserve_vec3_type);
   Opts.InstrumentFunctions = Args.hasArg(OPT_finstrument_functions);
   Opts.XRayInstrumentFunctions = Args.hasArg(OPT_fxray_instrument);
   Opts.XRayInstructionThreshold =
Index: lib/CodeGen/CGExpr.cpp
===
--- lib/CodeGen/CGExpr.cpp
+++ lib/CodeGen/CGExpr.cpp
@@ -1368,26 +1368,28 @@
QualType TBAABaseType,
uint64_t TBAAOffset,
bool isNontemporal) {
-  // For better performance, handle vector loads differently.
-  if (Ty->isVectorType()) {
-const llvm::Type *EltTy = Addr.getElementType();
-
-const auto *VTy = cast(EltTy);
-
-// Handle vectors of size 3 like size 4 for better performance.
-if (VTy->getNumElements() == 3) {
-
-  // Bitcast to vec4 type.
-  llvm::VectorType *vec4Ty = llvm::VectorType::get(VTy->getElementType(),
- 4);
-  Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
-  // Now load value.
-  llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
-
-  // Shuffle vector to get vec3.
-  V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
-  {0, 1, 2}, "extractVec");
-  return EmitFromMemory(V, Ty);
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+// For better performance, handle vector loads differently.
+if (Ty->isVectorType()) {
+  const llvm::Type *EltTy = Addr.getElementType();
+
+  const auto *VTy = cast(EltTy);
+
+  // Handle vectors of size 3 like size 4 for better performance.
+  if (VTy->getNumElements() == 3) {
+
+// Bitcast to vec4 type.
+llvm::VectorType *vec4Ty =
+llvm::VectorType::get(VTy->getElementType(), 4);
+Address Cast = Builder.CreateElementBitCast(Addr, vec4Ty, "castToVec4");
+// Now load value.
+llvm::Value *V = Builder.CreateLoad(Cast, Volatile, "loadVec4");
+
+// Shuffle vector to get vec3.
+V = Builder.CreateShuffleVector(V, llvm::UndefValue::get(vec4Ty),
+{0, 1, 2}, "extractVec");
+return EmitFromMemory(V, Ty);
+  }
 }
   }
 
@@ -1456,23 +1458,24 @@
 bool isNontemporal) {
 
   // Handle vectors differently to get better performance.
-  if (Ty->isVectorType()) {
-llvm::Type *SrcTy = Value->getType();
-auto *VecTy = cast(SrcTy);
-// Handle vec3 special.
-if (VecTy->getNumElements() == 3) {
-  // Our source is a vec3, do a shuffle vector to make it a vec4.
-  llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
-Builder.getInt32(2),
-llvm::UndefValue::get(Builder.getInt32Ty())};
-  llvm::Value *MaskV = llvm::ConstantVector::get(Mask);
-  Value = Builder.CreateShuffleVector(Value,
-  llvm::UndefValue::get(VecTy),
-  MaskV, "extractVec");
-  SrcTy = llvm::VectorType::get(VecTy->getElementType(), 4);
-}
-if (Addr.getElementType() != SrcTy) {
-  Addr = Builder.CreateElementBitCast(Addr, SrcTy, "storetmp");
+  if (!CGM.getCodeGenOpts().PreserveVec3Type) {
+if (Ty->isVectorType()) {
+  llvm::Type *SrcTy = Value->getType();
+  auto *VecTy = cast(SrcTy);
+  // Handle vec3 special.
+  if (VecTy->getNumElements() == 3) {
+// Our source is a vec3, do a shuffle vector to make it a vec4.
+llvm::Constant *Mask[] = {Builder.getInt32(0), Builder.getInt32(1),
+