[llvm-branch-commits] [llvm] release/18.x: [AArch64][SelectionDAG] Mask for SUBS with multiple users cannot be elided (#90911) (PR #91151)

2024-05-14 Thread David Green via llvm-branch-commits

davemgreen wrote:

LGTM, I believe this should be safe to merge, if there are people asking for it.

https://github.com/llvm/llvm-project/pull/91151
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] release/18.x: [AArch64] Remove invalid uabdl patterns. (#89272) (PR #89380)

2024-04-23 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

I think this should be OK for the branch, if it is wanted. It should be a safe 
commit to backport, considering it just removes some invalid patterns. LGTM.

https://github.com/llvm/llvm-project/pull/89380
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AArch64][GlobalISel] Avoid splitting loads of large vector types into individual element loads (PR #85042)

2024-03-14 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

Thanks. LGTM

https://github.com/llvm/llvm-project/pull/85042
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AArch64][GlobalISel] Avoid splitting loads of large vector types into individual element loads (PR #85042)

2024-03-13 Thread David Green via llvm-branch-commits

https://github.com/davemgreen commented:

It looks like this comes from the lowerIfMemSizeNotByteSizePow2. Custom is 
often best avoided unless there is not anther way, or the change is quite 
target-dependant.

Can we try something like this instead?
```
  .clampMaxNumElements(0, s8, 16)
  .clampMaxNumElements(0, s16, 8)
  .clampMaxNumElements(0, s32, 4)
  .clampMaxNumElements(0, s64, 2)
  .clampMaxNumElements(0, p0, 2)
  .lowerIfMemSizeNotByteSizePow2()
  ...
```

https://github.com/llvm/llvm-project/pull/85042
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] release/18.x: [SelectionDAG] Change computeAliasing signature from optional to LocationSize. (#83017) (PR #83848)

2024-03-04 Thread David Green via llvm-branch-commits

davemgreen wrote:

See https://github.com/llvm/llvm-project/pull/83017, about it fixing the bug in 
a recent RISCV issue. I think the "Requested by" comes from the git committer.

@lukel97 i'm not sure if you have already or not, but it might be good to 
include the recent test you added too.

https://github.com/llvm/llvm-project/pull/83848
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] release/18.x: [AArch64] Backport Ampere1B support (#81297 , #81341, and #81744) (PR #81857)

2024-02-22 Thread David Green via llvm-branch-commits

davemgreen wrote:

> Is this fixing a regression introduced in Clang 18? I'm wondering why the 
> backport is needed in the first place, as this seems to be making potentially 
> significant changes during the RC ("Make +pauth enabled in Armv8.3-a by 
> default").

It is adding new CPU support to clang 18, specifically the ampere-1b. I'm not 
sure what is considered acceptable at this stage. It is up to the release 
maintainers whether they want to accept it. On a technical level I believe it 
should be OK, but if only regressions are being fixed at this stage then it 
might be better to wait for clang-19. I'm not sure how much @ptomsich wanted to 
get this into clang-18?

https://github.com/llvm/llvm-project/pull/81857
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] release/18.x: [AArch64] Backport Ampere1B support (#81297 , #81341, and #81744) (PR #81857)

2024-02-20 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

I believe considering what this changes it should be OK. LGTM

https://github.com/llvm/llvm-project/pull/81857
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] release/18.x: [AArch64] Backport Ampere1B support (#81297 , #81341, and #81744) (PR #81857)

2024-02-19 Thread David Green via llvm-branch-commits

davemgreen wrote:

This is a fairly big patch to backport. The ampere1b changes should be safe 
enough considering as it just adds support for an extra CPU. There is also the 
change from #78027 added for changing how PAUTH is enabled.

@atrosinenko @DavidSpickett do you think that is OK to backport to LLVM 18?

https://github.com/llvm/llvm-project/pull/81857
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [AArch64][GlobalISel] Improve codegen for G_VECREDUCE_{SMIN, SMAX, UMIN, UMAX} for odd-sized vectors (PR #81831)

2024-02-15 Thread David Green via llvm-branch-commits


@@ -1070,6 +1070,13 @@ AArch64LegalizerInfo::AArch64LegalizerInfo(const 
AArch64Subtarget )
  {s16, v8s16},
  {s32, v2s32},
  {s32, v4s32}})
+  .moreElementsIf(

davemgreen wrote:

I think this can happen for more than just odd numbers, if we have support in 
the legalizer. I think I would make it moreElementsToNextPow2 unless there is a 
big reason not to.

https://github.com/llvm/llvm-project/pull/81831
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] release/18.x: [AArch64] Only apply bool vector bitcast opt if result is scalar (#81256) (PR #81454)

2024-02-12 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

Looks good to get onto the branch to me

https://github.com/llvm/llvm-project/pull/81454
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [libcxx] [libc] [flang] [llvm] [clang] [compiler-rt] [SelectOpt] Print instruction instead of pointer (PR #80125)

2024-01-31 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

Thanks. LGTM

https://github.com/llvm/llvm-project/pull/80125
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [clang] PR for llvm/llvm-project#79614 (PR #79870)

2024-01-29 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

Sounds simple enough to me. LGTM

https://github.com/llvm/llvm-project/pull/79870
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] PR for llvm/llvm-project#79800 (PR #79813)

2024-01-29 Thread David Green via llvm-branch-commits

https://github.com/davemgreen approved this pull request.

The perf regression was fairly significant, so it would be good to get this 
into the branch. Thanks.

https://github.com/llvm/llvm-project/pull/79813
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] [compiler-rt] [TySan] A Type Sanitizer (Runtime Library) (PR #76261)

2024-01-20 Thread David Green via llvm-branch-commits


@@ -720,7 +726,7 @@ if(COMPILER_RT_SUPPORTED_ARCH)
 endif()
 message(STATUS "Compiler-RT supported architectures: 
${COMPILER_RT_SUPPORTED_ARCH}")
 
-set(ALL_SANITIZERS 
asan;dfsan;msan;hwasan;tsan;safestack;cfi;scudo_standalone;ubsan_minimal;gwp_asan;asan_abi)
+set(ALL_SANITIZERS 
asan;dfsan;msan;hwasan;tsan;tysan,safestack;cfi;scudo_standalone;ubsan_minimal;gwp_asan;asan_abi)

davemgreen wrote:

^ `tysan;`

https://github.com/llvm/llvm-project/pull/76261
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] e841bd5 - [ARM] Extra MVE unaligned VLDn tests. NFC

2021-01-24 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-24T21:39:00Z
New Revision: e841bd5f335864b8c4d81cbf4df08460ef39f2ae

URL: 
https://github.com/llvm/llvm-project/commit/e841bd5f335864b8c4d81cbf4df08460ef39f2ae
DIFF: 
https://github.com/llvm/llvm-project/commit/e841bd5f335864b8c4d81cbf4df08460ef39f2ae.diff

LOG: [ARM] Extra MVE unaligned VLDn tests. NFC

Added: 


Modified: 
llvm/test/CodeGen/Thumb2/mve-vld2.ll
llvm/test/CodeGen/Thumb2/mve-vld4.ll
llvm/test/CodeGen/Thumb2/mve-vst2.ll
llvm/test/CodeGen/Thumb2/mve-vst4.ll

Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-vld2.ll 
b/llvm/test/CodeGen/Thumb2/mve-vld2.ll
index f33a7237151c..b5309aab1f60 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vld2.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vld2.ll
@@ -98,6 +98,23 @@ entry:
   ret void
 }
 
+define void @vld2_v4i32_align1(<8 x i32> *%src, <4 x i32> *%dst) {
+; CHECK-LABEL: vld2_v4i32_align1:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vld20.32 {q0, q1}, [r0]
+; CHECK-NEXT:vld21.32 {q0, q1}, [r0]
+; CHECK-NEXT:vadd.i32 q0, q0, q1
+; CHECK-NEXT:vstrw.32 q0, [r1]
+; CHECK-NEXT:bx lr
+entry:
+  %l1 = load <8 x i32>, <8 x i32>* %src, align 1
+  %s1 = shufflevector <8 x i32> %l1, <8 x i32> undef, <4 x i32> 
+  %s2 = shufflevector <8 x i32> %l1, <8 x i32> undef, <4 x i32> 
+  %a = add <4 x i32> %s1, %s2
+  store <4 x i32> %a, <4 x i32> *%dst
+  ret void
+}
+
 ; i16
 
 define void @vld2_v2i16(<4 x i16> *%src, <2 x i16> *%dst) {
@@ -115,7 +132,7 @@ define void @vld2_v2i16(<4 x i16> *%src, <2 x i16> *%dst) {
 ; CHECK-NEXT:strh r0, [r1]
 ; CHECK-NEXT:bx lr
 entry:
-  %l1 = load <4 x i16>, <4 x i16>* %src, align 4
+  %l1 = load <4 x i16>, <4 x i16>* %src, align 2
   %s1 = shufflevector <4 x i16> %l1, <4 x i16> undef, <2 x i32> 
   %s2 = shufflevector <4 x i16> %l1, <4 x i16> undef, <2 x i32> 
   %a = add <2 x i16> %s1, %s2
@@ -126,13 +143,13 @@ entry:
 define void @vld2_v4i16(<8 x i16> *%src, <4 x i16> *%dst) {
 ; CHECK-LABEL: vld2_v4i16:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:vldrw.u32 q0, [r0]
+; CHECK-NEXT:vldrh.u16 q0, [r0]
 ; CHECK-NEXT:vrev32.16 q1, q0
 ; CHECK-NEXT:vadd.i32 q0, q0, q1
 ; CHECK-NEXT:vstrh.32 q0, [r1]
 ; CHECK-NEXT:bx lr
 entry:
-  %l1 = load <8 x i16>, <8 x i16>* %src, align 4
+  %l1 = load <8 x i16>, <8 x i16>* %src, align 2
   %s1 = shufflevector <8 x i16> %l1, <8 x i16> undef, <4 x i32> 
   %s2 = shufflevector <8 x i16> %l1, <8 x i16> undef, <4 x i32> 
   %a = add <4 x i16> %s1, %s2
@@ -149,7 +166,7 @@ define void @vld2_v8i16(<16 x i16> *%src, <8 x i16> *%dst) {
 ; CHECK-NEXT:vstrw.32 q0, [r1]
 ; CHECK-NEXT:bx lr
 entry:
-  %l1 = load <16 x i16>, <16 x i16>* %src, align 4
+  %l1 = load <16 x i16>, <16 x i16>* %src, align 2
   %s1 = shufflevector <16 x i16> %l1, <16 x i16> undef, <8 x i32> 
   %s2 = shufflevector <16 x i16> %l1, <16 x i16> undef, <8 x i32> 
   %a = add <8 x i16> %s1, %s2
@@ -170,7 +187,7 @@ define void @vld2_v16i16(<32 x i16> *%src, <16 x i16> 
*%dst) {
 ; CHECK-NEXT:vstrw.32 q1, [r1, #16]
 ; CHECK-NEXT:bx lr
 entry:
-  %l1 = load <32 x i16>, <32 x i16>* %src, align 4
+  %l1 = load <32 x i16>, <32 x i16>* %src, align 2
   %s1 = shufflevector <32 x i16> %l1, <32 x i16> undef, <16 x i32> 
   %s2 = shufflevector <32 x i16> %l1, <32 x i16> undef, <16 x i32> 
   %a = add <16 x i16> %s1, %s2
@@ -178,6 +195,23 @@ entry:
   ret void
 }
 
+define void @vld2_v8i16_align1(<16 x i16> *%src, <8 x i16> *%dst) {
+; CHECK-LABEL: vld2_v8i16_align1:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vld20.16 {q0, q1}, [r0]
+; CHECK-NEXT:vld21.16 {q0, q1}, [r0]
+; CHECK-NEXT:vadd.i16 q0, q0, q1
+; CHECK-NEXT:vstrw.32 q0, [r1]
+; CHECK-NEXT:bx lr
+entry:
+  %l1 = load <16 x i16>, <16 x i16>* %src, align 1
+  %s1 = shufflevector <16 x i16> %l1, <16 x i16> undef, <8 x i32> 
+  %s2 = shufflevector <16 x i16> %l1, <16 x i16> undef, <8 x i32> 
+  %a = add <8 x i16> %s1, %s2
+  store <8 x i16> %a, <8 x i16> *%dst
+  ret void
+}
+
 ; i8
 
 define void @vld2_v2i8(<4 x i8> *%src, <2 x i8> *%dst) {
@@ -195,7 +229,7 @@ define void @vld2_v2i8(<4 x i8> *%src, <2 x i8> *%dst) {
 ; CHECK-NEXT:strb r0, [r1]
 ; CHECK-NEXT:bx lr
 entry:
-  %l1 = load <4 x i8>, <4 x i8>* %src, align 4
+  %l1 = load <4 x i8>, <4 x i8>* %src, align 1
   %s1 = shufflevector <4 x i8> %l1, <4 x i8> undef, <2 x i32> 
   %s2 = shufflevector <4 x i8> %l1, <4 x i8> undef, <2 x i32> 
   %a = add <2 x i8> %s1, %s2
@@ -212,7 +246,7 @@ define void @vld2_v4i8(<8 x i8> *%src, <4 x i8> *%dst) {
 ; CHECK-NEXT:vstrb.32 q0, [r1]
 ; CHECK-NEXT:bx lr
 entry:
-  %l1 = load <8 x i8>, <8 x i8>* %src, align 4
+  %l1 = load <8 x i8>, <8 x i8>* %src, align 1
   %s1 = shufflevector <8 x i8> %l1, <8 x i8> undef, <4 x i32> 
   %s2 = shufflevector <8 x i8> %l1, <8 x i8> undef, <4 x i32> 
   %a = add <4 x i8> %s1, %s2
@@ -223,13 +257,13 @@ entry:
 define void 

[llvm-branch-commits] [llvm] 4cc94b7 - [CostModel] Tests for showing the cost of intrinsics from the vectorizer. NFC

2021-01-24 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-24T14:47:15Z
New Revision: 4cc94b731345aa494e0e364846ba9550f5dd5105

URL: 
https://github.com/llvm/llvm-project/commit/4cc94b731345aa494e0e364846ba9550f5dd5105
DIFF: 
https://github.com/llvm/llvm-project/commit/4cc94b731345aa494e0e364846ba9550f5dd5105.diff

LOG: [CostModel] Tests for showing the cost of intrinsics from the vectorizer. 
NFC

Added: 
llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
llvm/test/Transforms/LoopVectorize/X86/intrinsiccost.ll

Modified: 
llvm/test/Transforms/LoopVectorize/ARM/mve-saddsatcost.ll

Removed: 




diff  --git a/llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll 
b/llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
new file mode 100644
index ..b86a7da0daff
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
@@ -0,0 +1,211 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -loop-vectorize -instcombine -simplifycfg < %s -S -o - | FileCheck 
%s --check-prefix=CHECK
+; RUN: opt -loop-vectorize -debug-only=loop-vectorize -disable-output < %s 
2>&1 | FileCheck %s --check-prefix=CHECK-COST
+; REQUIRES: asserts
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64--linux-gnu"
+
+; CHECK-COST-LABEL: sadd
+; CHECK-COST: Found an estimated cost of 10 for VF 1 For instruction:   %1 = 
tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
+; CHECK-COST: Found an estimated cost of 26 for VF 2 For instruction:   %1 = 
tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
+; CHECK-COST: Found an estimated cost of 58 for VF 4 For instruction:   %1 = 
tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
+; CHECK-COST: Found an estimated cost of 122 for VF 8 For instruction:   %1 = 
tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
+
+define void @saddsat(i16* nocapture readonly %pSrc, i16 signext %offset, i16* 
nocapture noalias %pDst, i32 %blockSize) #0 {
+; CHECK-LABEL: @saddsat(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:[[CMP_NOT6:%.*]] = icmp eq i32 [[BLOCKSIZE:%.*]], 0
+; CHECK-NEXT:br i1 [[CMP_NOT6]], label [[WHILE_END:%.*]], label 
[[WHILE_BODY_PREHEADER:%.*]]
+; CHECK:   while.body.preheader:
+; CHECK-NEXT:[[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1
+; CHECK-NEXT:[[TMP1:%.*]] = zext i32 [[TMP0]] to i64
+; CHECK-NEXT:[[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
+; CHECK-NEXT:[[MIN_ITERS_CHECK:%.*]] = icmp eq i32 [[TMP0]], 0
+; CHECK-NEXT:br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label 
[[VECTOR_PH:%.*]]
+; CHECK:   vector.ph:
+; CHECK-NEXT:[[N_VEC:%.*]] = and i64 [[TMP2]], 8589934590
+; CHECK-NEXT:[[CAST_CRD:%.*]] = trunc i64 [[N_VEC]] to i32
+; CHECK-NEXT:[[IND_END:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD]]
+; CHECK-NEXT:[[IND_END2:%.*]] = getelementptr i16, i16* [[PSRC:%.*]], i64 
[[N_VEC]]
+; CHECK-NEXT:[[IND_END4:%.*]] = getelementptr i16, i16* [[PDST:%.*]], i64 
[[N_VEC]]
+; CHECK-NEXT:[[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i16> 
poison, i16 [[OFFSET:%.*]], i32 0
+; CHECK-NEXT:[[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i16> 
[[BROADCAST_SPLATINSERT]], <2 x i16> poison, <2 x i32> zeroinitializer
+; CHECK-NEXT:br label [[VECTOR_BODY:%.*]]
+; CHECK:   vector.body:
+; CHECK-NEXT:[[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ 
[[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:[[NEXT_GEP:%.*]] = getelementptr i16, i16* [[PSRC]], i64 
[[INDEX]]
+; CHECK-NEXT:[[NEXT_GEP5:%.*]] = getelementptr i16, i16* [[PDST]], i64 
[[INDEX]]
+; CHECK-NEXT:[[TMP3:%.*]] = bitcast i16* [[NEXT_GEP]] to <2 x i16>*
+; CHECK-NEXT:[[WIDE_LOAD:%.*]] = load <2 x i16>, <2 x i16>* [[TMP3]], 
align 2
+; CHECK-NEXT:[[TMP4:%.*]] = call <2 x i16> @llvm.sadd.sat.v2i16(<2 x i16> 
[[WIDE_LOAD]], <2 x i16> [[BROADCAST_SPLAT]])
+; CHECK-NEXT:[[TMP5:%.*]] = bitcast i16* [[NEXT_GEP5]] to <2 x i16>*
+; CHECK-NEXT:store <2 x i16> [[TMP4]], <2 x i16>* [[TMP5]], align 2
+; CHECK-NEXT:[[INDEX_NEXT]] = add i64 [[INDEX]], 2
+; CHECK-NEXT:[[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.*]], label 
[[VECTOR_BODY]], [[LOOP0:!llvm.loop !.*]]
+; CHECK:   middle.block:
+; CHECK-NEXT:[[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
+; CHECK-NEXT:br i1 [[CMP_N]], label [[WHILE_END]], label [[SCALAR_PH]]
+; CHECK:   scalar.ph:
+; CHECK-NEXT:[[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], 
[[MIDDLE_BLOCK]] ], [ [[BLOCKSIZE]], [[WHILE_BODY_PREHEADER]] ]
+; CHECK-NEXT:[[BC_RESUME_VAL1:%.*]] = phi i16* [ [[IND_END2]], 
[[MIDDLE_BLOCK]] ], [ [[PSRC]], [[WHILE_BODY_PREHEADER]] ]
+; CHECK-NEXT:[[BC_RESUME_VAL3:%.*]] = phi i16* [ [[IND_END4]], 
[[MIDDLE_BLOCK]] ], [ [[PDST]], [[WHILE_BODY_PREHEADER]] ]
+; CHECK-NEXT:br label 

[llvm-branch-commits] [llvm] 06ab795 - [AArch64] Saturating add cost tests. NFC

2021-01-24 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-24T13:49:17Z
New Revision: 06ab7953e98222de1ace4520163b4fa53565ead4

URL: 
https://github.com/llvm/llvm-project/commit/06ab7953e98222de1ace4520163b4fa53565ead4
DIFF: 
https://github.com/llvm/llvm-project/commit/06ab7953e98222de1ace4520163b4fa53565ead4.diff

LOG: [AArch64] Saturating add cost tests. NFC

Added: 
llvm/test/Analysis/CostModel/AArch64/arith-ssat.ll
llvm/test/Analysis/CostModel/AArch64/arith-usat.ll

Modified: 


Removed: 




diff  --git a/llvm/test/Analysis/CostModel/AArch64/arith-ssat.ll 
b/llvm/test/Analysis/CostModel/AArch64/arith-ssat.ll
new file mode 100644
index ..f55022141d50
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/AArch64/arith-ssat.ll
@@ -0,0 +1,215 @@
+; NOTE: Assertions have been autogenerated by 
utils/update_analyze_test_checks.py
+; RUN: opt -cost-model -analyze -mtriple=aarch64-none-eabi < %s | FileCheck %s 
--check-prefix=RECIP
+; RUN: opt -cost-model -analyze -cost-kind=code-size 
-mtriple=aarch64-none-eabi < %s | FileCheck %s --check-prefix=SIZE
+
+declare i64@llvm.sadd.sat.i64(i64, i64)
+declare <2 x i64>  @llvm.sadd.sat.v2i64(<2 x i64>, <2 x i64>)
+declare <4 x i64>  @llvm.sadd.sat.v4i64(<4 x i64>, <4 x i64>)
+declare <8 x i64>  @llvm.sadd.sat.v8i64(<8 x i64>, <8 x i64>)
+
+declare i32@llvm.sadd.sat.i32(i32, i32)
+declare <2 x i32>  @llvm.sadd.sat.v2i32(<2 x i32>, <2 x i32>)
+declare <4 x i32>  @llvm.sadd.sat.v4i32(<4 x i32>, <4 x i32>)
+declare <8 x i32>  @llvm.sadd.sat.v8i32(<8 x i32>, <8 x i32>)
+declare <16 x i32> @llvm.sadd.sat.v16i32(<16 x i32>, <16 x i32>)
+
+declare i16@llvm.sadd.sat.i16(i16, i16)
+declare <2 x i16>  @llvm.sadd.sat.v2i16(<2 x i16>, <2 x i16>)
+declare <4 x i16>  @llvm.sadd.sat.v4i16(<4 x i16>, <4 x i16>)
+declare <8 x i16>  @llvm.sadd.sat.v8i16(<8 x i16>, <8 x i16>)
+declare <16 x i16> @llvm.sadd.sat.v16i16(<16 x i16>, <16 x i16>)
+declare <32 x i16> @llvm.sadd.sat.v32i16(<32 x i16>, <32 x i16>)
+
+declare i8 @llvm.sadd.sat.i8(i8,  i8)
+declare <2 x i8>   @llvm.sadd.sat.v2i8(<2 x i8>, <2 x i8>)
+declare <4 x i8>   @llvm.sadd.sat.v4i8(<4 x i8>, <4 x i8>)
+declare <8 x i8>   @llvm.sadd.sat.v8i8(<8 x i8>, <8 x i8>)
+declare <16 x i8>  @llvm.sadd.sat.v16i8(<16 x i8>, <16 x i8>)
+declare <32 x i8>  @llvm.sadd.sat.v32i8(<32 x i8>, <32 x i8>)
+declare <64 x i8>  @llvm.sadd.sat.v64i8(<64 x i8>, <64 x i8>)
+
+define i32 @add(i32 %arg) {
+; RECIP-LABEL: 'add'
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I64 
= call i64 @llvm.sadd.sat.i64(i64 undef, i64 undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: 
%V2I64 = call <2 x i64> @llvm.sadd.sat.v2i64(<2 x i64> undef, <2 x i64> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 197 for instruction: 
%V4I64 = call <4 x i64> @llvm.sadd.sat.v4i64(<4 x i64> undef, <4 x i64> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 399 for instruction: 
%V8I64 = call <8 x i64> @llvm.sadd.sat.v8i64(<8 x i64> undef, <8 x i64> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I32 
= call i32 @llvm.sadd.sat.i32(i32 undef, i32 undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: 
%V2I32 = call <2 x i32> @llvm.sadd.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 58 for instruction: 
%V4I32 = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 85 for instruction: 
%V8I32 = call <8 x i32> @llvm.sadd.sat.v8i32(<8 x i32> undef, <8 x i32> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 175 for instruction: 
%V16I32 = call <16 x i32> @llvm.sadd.sat.v16i32(<16 x i32> undef, <16 x i32> 
undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I16 
= call i16 @llvm.sadd.sat.i16(i16 undef, i16 undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 26 for instruction: 
%V2I16 = call <2 x i16> @llvm.sadd.sat.v2i16(<2 x i16> undef, <2 x i16> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 58 for instruction: 
%V4I16 = call <4 x i16> @llvm.sadd.sat.v4i16(<4 x i16> undef, <4 x i16> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 122 for instruction: 
%V8I16 = call <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16> undef, <8 x i16> undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 165 for instruction: 
%V16I16 = call <16 x i16> @llvm.sadd.sat.v16i16(<16 x i16> undef, <16 x i16> 
undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 498 for instruction: 
%V32I16 = call <32 x i16> @llvm.sadd.sat.v32i16(<32 x i16> undef, <32 x i16> 
undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %I8 
= call i8 @llvm.sadd.sat.i8(i8 undef, i8 undef)
+; RECIP-NEXT:  Cost Model: Found an estimated cost of 

[llvm-branch-commits] [llvm] af03324 - [ARM] Disable sign extended SSAT pattern recognition.

2021-01-22 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-22T14:07:48Z
New Revision: af0332498405b3a4074cef09845bbacfd4fd594f

URL: 
https://github.com/llvm/llvm-project/commit/af0332498405b3a4074cef09845bbacfd4fd594f
DIFF: 
https://github.com/llvm/llvm-project/commit/af0332498405b3a4074cef09845bbacfd4fd594f.diff

LOG: [ARM] Disable sign extended SSAT pattern recognition.

I may have given bad advice, and skipping sext_inreg when matching SSAT
patterns is not valid on it's own. It at least needs to sext_inreg the
input again, but as far as I can tell is still only valid based on
demanded bits. For the moment disable that part of the combine,
hopefully reimplementing it in the future more correctly.

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/ARM/ssat.ll
llvm/test/CodeGen/ARM/usat.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 949d2ffc1714..f6f8597f3a69 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -5062,12 +5062,6 @@ static SDValue LowerSaturatingConditional(SDValue Op, 
SelectionDAG ) {
   SDValue V1Tmp = V1;
   SDValue V2Tmp = V2;
 
-  if (V1.getOpcode() == ISD::SIGN_EXTEND_INREG &&
-  V2.getOpcode() == ISD::SIGN_EXTEND_INREG) {
-V1Tmp = V1.getOperand(0);
-V2Tmp = V2.getOperand(0);
-  }
-
   // Check that the registers and the constants match a max(min()) or 
min(max())
   // pattern
   if (V1Tmp != TrueVal1 || V2Tmp != TrueVal2 || K1 != FalseVal1 ||

diff  --git a/llvm/test/CodeGen/ARM/ssat.ll b/llvm/test/CodeGen/ARM/ssat.ll
index 9d9758b0515d..fb3c17710b75 100644
--- a/llvm/test/CodeGen/ARM/ssat.ll
+++ b/llvm/test/CodeGen/ARM/ssat.ll
@@ -68,7 +68,15 @@ define i16 @sat_base_16bit(i16 %x) #0 {
 ;
 ; V6T2-LABEL: sat_base_16bit:
 ; V6T2:   @ %bb.0: @ %entry
-; V6T2-NEXT:ssat r0, #12, r0
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:movw r2, #2047
+; V6T2-NEXT:cmp r1, r2
+; V6T2-NEXT:movlt r2, r0
+; V6T2-NEXT:movw r0, #63488
+; V6T2-NEXT:sxth r1, r2
+; V6T2-NEXT:movt r0, #65535
+; V6T2-NEXT:cmn r1, #2048
+; V6T2-NEXT:movgt r0, r2
 ; V6T2-NEXT:bx lr
 entry:
   %0 = icmp slt i16 %x, 2047
@@ -95,7 +103,12 @@ define i8 @sat_base_8bit(i8 %x) #0 {
 ;
 ; V6T2-LABEL: sat_base_8bit:
 ; V6T2:   @ %bb.0: @ %entry
-; V6T2-NEXT:ssat r0, #6, r0
+; V6T2-NEXT:sxtb r1, r0
+; V6T2-NEXT:cmp r1, #31
+; V6T2-NEXT:movge r0, #31
+; V6T2-NEXT:sxtb r1, r0
+; V6T2-NEXT:cmn r1, #32
+; V6T2-NEXT:mvnle r0, #31
 ; V6T2-NEXT:bx lr
 entry:
   %0 = icmp slt i8 %x, 31
@@ -547,7 +560,12 @@ define void @extended(i32 %xx, i16 signext %y, i8* 
nocapture %z) {
 ; V6T2-LABEL: extended:
 ; V6T2:   @ %bb.0: @ %entry
 ; V6T2-NEXT:add r0, r1, r0, lsr #16
-; V6T2-NEXT:ssat r0, #8, r0
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:cmp r1, #127
+; V6T2-NEXT:movge r0, #127
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:cmn r1, #128
+; V6T2-NEXT:mvnle r0, #127
 ; V6T2-NEXT:strb r0, [r2]
 ; V6T2-NEXT:bx lr
 entry:
@@ -582,7 +600,12 @@ define i32 @formulated_valid(i32 %a) {
 ;
 ; V6T2-LABEL: formulated_valid:
 ; V6T2:   @ %bb.0:
-; V6T2-NEXT:ssat r0, #8, r0
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:cmp r1, #127
+; V6T2-NEXT:movge r0, #127
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:cmn r1, #128
+; V6T2-NEXT:mvnle r0, #127
 ; V6T2-NEXT:uxth r0, r0
 ; V6T2-NEXT:bx lr
   %x1 = trunc i32 %a to i16
@@ -613,7 +636,12 @@ define i32 @formulated_invalid(i32 %a) {
 ;
 ; V6T2-LABEL: formulated_invalid:
 ; V6T2:   @ %bb.0:
-; V6T2-NEXT:ssat r0, #8, r0
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:cmp r1, #127
+; V6T2-NEXT:movge r0, #127
+; V6T2-NEXT:sxth r1, r0
+; V6T2-NEXT:cmn r1, #128
+; V6T2-NEXT:mvnle r0, #127
 ; V6T2-NEXT:bic r0, r0, #-16777216
 ; V6T2-NEXT:bx lr
   %x1 = trunc i32 %a to i16

diff  --git a/llvm/test/CodeGen/ARM/usat.ll b/llvm/test/CodeGen/ARM/usat.ll
index ab508fc0e032..dd0eca823a50 100644
--- a/llvm/test/CodeGen/ARM/usat.ll
+++ b/llvm/test/CodeGen/ARM/usat.ll
@@ -67,12 +67,27 @@ define i16 @unsigned_sat_base_16bit(i16 %x) #0 {
 ;
 ; V6-LABEL: unsigned_sat_base_16bit:
 ; V6:   @ %bb.0: @ %entry
-; V6-NEXT:usat r0, #11, r0
+; V6-NEXT:mov r1, #255
+; V6-NEXT:sxth r2, r0
+; V6-NEXT:orr r1, r1, #1792
+; V6-NEXT:cmp r2, r1
+; V6-NEXT:movlt r1, r0
+; V6-NEXT:sxth r0, r1
+; V6-NEXT:cmp r0, #0
+; V6-NEXT:movle r1, #0
+; V6-NEXT:mov r0, r1
 ; V6-NEXT:bx lr
 ;
 ; V6T2-LABEL: unsigned_sat_base_16bit:
 ; V6T2:   @ %bb.0: @ %entry
-; V6T2-NEXT:usat r0, #11, r0
+; V6T2-NEXT:sxth r2, r0
+; V6T2-NEXT:movw r1, #2047
+; V6T2-NEXT:cmp r2, r1
+; V6T2-NEXT:movlt r1, r0
+; V6T2-NEXT:sxth r0, r1
+; V6T2-NEXT:cmp r0, #0
+; V6T2-NEXT:movle r1, #0
+; V6T2-NEXT:mov r0, r1
 ; 

[llvm-branch-commits] [llvm] 9ae73cd - [ARM] Adjust isSaturatingConditional to return a new SDValue. NFC

2021-01-22 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-22T11:11:36Z
New Revision: 9ae73cdbc1e59fd3149e60efd2b96e68e8d1669b

URL: 
https://github.com/llvm/llvm-project/commit/9ae73cdbc1e59fd3149e60efd2b96e68e8d1669b
DIFF: 
https://github.com/llvm/llvm-project/commit/9ae73cdbc1e59fd3149e60efd2b96e68e8d1669b.diff

LOG: [ARM] Adjust isSaturatingConditional to return a new SDValue. NFC

This replaces the isSaturatingConditional function with
LowerSaturatingConditional that directly returns a new SSAT or
USAT SDValue, instead of returning true and the components of it.

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index aabfad045d9f..949d2ffc1714 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -5036,17 +5036,13 @@ static bool isLowerSaturate(const SDValue LHS, const 
SDValue RHS,
 // etc.
 //
 // LLVM canonicalizes these to either a min(max()) or a max(min())
-// pattern. This function tries to match one of these and will return true
-// if successful.
+// pattern. This function tries to match one of these and will return a SSAT
+// node if successful.
 //
-// USAT works similarily to SSAT but bounds on the interval [0, k] where k + 1 
is
-// a power of 2.
-//
-// It returns true if the conversion can be done, false otherwise.
-// Additionally, the variable is returned in parameter V, the constant in K and
-// usat is set to true if the conditional represents an unsigned saturation
-static bool isSaturatingConditional(const SDValue , SDValue ,
-uint64_t , bool ) {
+// USAT works similarily to SSAT but bounds on the interval [0, k] where k + 1
+// is a power of 2.
+static SDValue LowerSaturatingConditional(SDValue Op, SelectionDAG ) {
+  EVT VT = Op.getValueType();
   SDValue V1 = Op.getOperand(0);
   SDValue K1 = Op.getOperand(1);
   SDValue TrueVal1 = Op.getOperand(2);
@@ -5055,7 +5051,7 @@ static bool isSaturatingConditional(const SDValue , 
SDValue ,
 
   const SDValue Op2 = isa(TrueVal1) ? FalseVal1 : TrueVal1;
   if (Op2.getOpcode() != ISD::SELECT_CC)
-return false;
+return SDValue();
 
   SDValue V2 = Op2.getOperand(0);
   SDValue K2 = Op2.getOperand(1);
@@ -5074,41 +5070,39 @@ static bool isSaturatingConditional(const SDValue , 
SDValue ,
 
   // Check that the registers and the constants match a max(min()) or 
min(max())
   // pattern
-  if (V1Tmp == TrueVal1 && V2Tmp == TrueVal2 && K1 == FalseVal1 &&
-  K2 == FalseVal2 &&
-  ((isGTorGE(CC1) && isLTorLE(CC2)) || (isLTorLE(CC1) && isGTorGE(CC2 {
-
-// Check that the constant in the lower-bound check is
-// the opposite of the constant in the upper-bound check
-// in 1's complement.
-if (!isa(K1) || !isa(K2))
-  return false;
+  if (V1Tmp != TrueVal1 || V2Tmp != TrueVal2 || K1 != FalseVal1 ||
+  K2 != FalseVal2 ||
+  !((isGTorGE(CC1) && isLTorLE(CC2)) || (isLTorLE(CC1) && isGTorGE(CC2
+return SDValue();
 
-int64_t Val1 = cast(K1)->getSExtValue();
-int64_t Val2 = cast(K2)->getSExtValue();
-int64_t PosVal = std::max(Val1, Val2);
-int64_t NegVal = std::min(Val1, Val2);
+  // Check that the constant in the lower-bound check is
+  // the opposite of the constant in the upper-bound check
+  // in 1's complement.
+  if (!isa(K1) || !isa(K2))
+return SDValue();
 
-if (!((Val1 > Val2 && isLTorLE(CC1)) || (Val1 < Val2 && isLTorLE(CC2))) ||
-!isPowerOf2_64(PosVal + 1)) 
-  return false;
+  int64_t Val1 = cast(K1)->getSExtValue();
+  int64_t Val2 = cast(K2)->getSExtValue();
+  int64_t PosVal = std::max(Val1, Val2);
+  int64_t NegVal = std::min(Val1, Val2);
 
-// Handle the 
diff erence between USAT (unsigned) and SSAT (signed)
-// saturation
-if (Val1 == ~Val2)
-  Usat = false;
-else if (NegVal == 0)
-  Usat = true;
-else
-  return false;
+  if (!((Val1 > Val2 && isLTorLE(CC1)) || (Val1 < Val2 && isLTorLE(CC2))) ||
+  !isPowerOf2_64(PosVal + 1))
+return SDValue();
 
-V = V2Tmp;
-// At this point, PosVal is guaranteed to be positive
-K = (uint64_t) PosVal; 
+  // Handle the 
diff erence between USAT (unsigned) and SSAT (signed)
+  // saturation
+  // At this point, PosVal is guaranteed to be positive
+  uint64_t K = PosVal;
+  SDLoc dl(Op);
+  if (Val1 == ~Val2)
+return DAG.getNode(ARMISD::SSAT, dl, VT, V2Tmp,
+   DAG.getConstant(countTrailingOnes(K), dl, VT));
+  if (NegVal == 0)
+return DAG.getNode(ARMISD::USAT, dl, VT, V2Tmp,
+   DAG.getConstant(countTrailingOnes(K), dl, VT));
 
-return true;
-  }
-  return false;
+  return SDValue();
 }
 
 // Check if a condition of the type x < k ? k : x can be converted into a
@@ -5168,18 +5162,9 @@ SDValue ARMTargetLowering::LowerSELECT_CC(SDValue 

[llvm-branch-commits] [llvm] 39db575 - [LV][ARM] Inloop reduction cost modelling

2021-01-21 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-21T21:03:41Z
New Revision: 39db5753f993abcc4289dd165e8297a4e28f4b0a

URL: 
https://github.com/llvm/llvm-project/commit/39db5753f993abcc4289dd165e8297a4e28f4b0a
DIFF: 
https://github.com/llvm/llvm-project/commit/39db5753f993abcc4289dd165e8297a4e28f4b0a.diff

LOG: [LV][ARM] Inloop reduction cost modelling

This adds cost modelling for the inloop vectorization added in
745bf6cf4471. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:

  %sa = sext <16 x i8> A to <16 x i32>
  %sb = sext <16 x i8> B to <16 x i32>
  %m = mul <16 x i32> %sa, %sb
  %r = vecreduce.add(%m)
  ->
  R = VMLADAV A, B

There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.

Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.

This patch attempt to improve the costmodelling of in-loop reductions
by:
 - Adding some pattern matching in the loop vectorizer cost model to
   match extended reduction patterns that are optionally extended and/or
   MLA patterns. This marks the cost of the reduction instruction correctly
   and the sext/zext/mul leading up to it as free, which is otherwise
   difficult to tell and may get a very high cost. (In the long run this
   can hopefully be replaced by vplan producing a single node and costing
   it correctly, but that is not yet something that vplan can do).
 - getExtendedAddReductionCost is added to query the cost of these
   extended reduction patterns.
 - Expanded the ARM costs to account for these expanded sizes, which is a
   fairly simple change in itself.
 - Some minor alterations to allow inloop reduction larger than the highest
   vector width and i64 MVE reductions.
 - An extra InLoopReductionImmediateChains map was added to the vectorizer
   for it to efficiently detect which instructions are reductions in the
   cost model.
 - The tests have some updates to show what I believe is optimal
   vectorization and where we are now.

Put together this can greatly improve performance for reduction loop
under MVE.

Differential Revision: https://reviews.llvm.org/D93476

Added: 


Modified: 
llvm/include/llvm/Analysis/TargetTransformInfo.h
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
llvm/include/llvm/CodeGen/BasicTTIImpl.h
llvm/lib/Analysis/TargetTransformInfo.cpp
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
llvm/lib/Target/ARM/ARMTargetTransformInfo.h
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll
llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

Removed: 




diff  --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h 
b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index ee34312ccf6d..040450bd9f27 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1182,6 +1182,16 @@ class TargetTransformInfo {
 VectorType *Ty, VectorType *CondTy, bool IsPairwiseForm, bool IsUnsigned,
 TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;
 
+  /// Calculate the cost of an extended reduction pattern, similar to
+  /// getArithmeticReductionCost of an Add reduction with an extension and
+  /// optional multiply. This is the cost of as:
+  /// ResTy vecreduce.add(ext(Ty A)), or if IsMLA flag is set then:
+  /// ResTy vecreduce.add(mul(ext(Ty A), ext(Ty B)). The reduction happens
+  /// on a VectorType with ResTy elements and Ty lanes.
+  InstructionCost getExtendedAddReductionCost(
+  bool IsMLA, bool IsUnsigned, Type *ResTy, VectorType *Ty,
+  TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;
+
   /// \returns 

[llvm-branch-commits] [llvm] dfac521 - [ARM] Fix vector saddsat costs.

2021-01-21 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-21T15:30:39Z
New Revision: dfac521da1b90db6832a0d357f67cb819ea8687f

URL: 
https://github.com/llvm/llvm-project/commit/dfac521da1b90db6832a0d357f67cb819ea8687f
DIFF: 
https://github.com/llvm/llvm-project/commit/dfac521da1b90db6832a0d357f67cb819ea8687f.diff

LOG: [ARM] Fix vector saddsat costs.

It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.

A vectorizer test is added to show it vectorizing at the correct factor
again.

Added: 
llvm/test/Transforms/LoopVectorize/ARM/mve-saddsatcost.ll

Modified: 
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp 
b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index a94d35118051..46c5ba12e82a 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -1531,8 +1531,13 @@ int ARMTTIImpl::getIntrinsicInstrCost(const 
IntrinsicCostAttributes ,
   case Intrinsic::usub_sat: {
 if (!ST->hasMVEIntegerOps())
   break;
+// Get the Return type, either directly of from ICA.ReturnType and ICA.VF.
+Type *VT = ICA.getReturnType();
+if (!VT->isVectorTy() && !ICA.getVectorFactor().isScalar())
+  VT = VectorType::get(VT, ICA.getVectorFactor());
+
 std::pair LT =
-TLI->getTypeLegalizationCost(DL, ICA.getReturnType());
+TLI->getTypeLegalizationCost(DL, VT);
 if (LT.second == MVT::v4i32 || LT.second == MVT::v8i16 ||
 LT.second == MVT::v16i8) {
   // This is a base cost of 1 for the vadd, plus 3 extract shifts if we

diff  --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-saddsatcost.ll 
b/llvm/test/Transforms/LoopVectorize/ARM/mve-saddsatcost.ll
new file mode 100644
index ..35a3153a58db
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/ARM/mve-saddsatcost.ll
@@ -0,0 +1,57 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -loop-vectorize -instcombine -simplifycfg < %s -S -o - | FileCheck 
%s --check-prefix=CHECK
+
+target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
+target triple = "thumbv8.1m.main-arm-none-eabi"
+
+define void @arm_offset_q15(i16* nocapture readonly %pSrc, i16 signext 
%offset, i16* nocapture noalias %pDst, i32 %blockSize) #0 {
+; CHECK-LABEL: @arm_offset_q15(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:[[CMP_NOT6:%.*]] = icmp eq i32 [[BLOCKSIZE:%.*]], 0
+; CHECK-NEXT:br i1 [[CMP_NOT6]], label [[WHILE_END:%.*]], label 
[[VECTOR_PH:%.*]]
+; CHECK:   vector.ph:
+; CHECK-NEXT:[[N_RND_UP:%.*]] = add i32 [[BLOCKSIZE]], 7
+; CHECK-NEXT:[[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
+; CHECK-NEXT:[[BROADCAST_SPLATINSERT8:%.*]] = insertelement <8 x i16> 
poison, i16 [[OFFSET:%.*]], i32 0
+; CHECK-NEXT:[[BROADCAST_SPLAT9:%.*]] = shufflevector <8 x i16> 
[[BROADCAST_SPLATINSERT8]], <8 x i16> poison, <8 x i32> zeroinitializer
+; CHECK-NEXT:br label [[VECTOR_BODY:%.*]]
+; CHECK:   vector.body:
+; CHECK-NEXT:[[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ 
[[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:[[NEXT_GEP:%.*]] = getelementptr i16, i16* [[PSRC:%.*]], i32 
[[INDEX]]
+; CHECK-NEXT:[[NEXT_GEP5:%.*]] = getelementptr i16, i16* [[PDST:%.*]], i32 
[[INDEX]]
+; CHECK-NEXT:[[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> 
@llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[BLOCKSIZE]])
+; CHECK-NEXT:[[TMP0:%.*]] = bitcast i16* [[NEXT_GEP]] to <8 x i16>*
+; CHECK-NEXT:[[WIDE_MASKED_LOAD:%.*]] = call <8 x i16> 
@llvm.masked.load.v8i16.p0v8i16(<8 x i16>* [[TMP0]], i32 2, <8 x i1> 
[[ACTIVE_LANE_MASK]], <8 x i16> poison)
+; CHECK-NEXT:[[TMP1:%.*]] = call <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16> 
[[WIDE_MASKED_LOAD]], <8 x i16> [[BROADCAST_SPLAT9]])
+; CHECK-NEXT:[[TMP2:%.*]] = bitcast i16* [[NEXT_GEP5]] to <8 x i16>*
+; CHECK-NEXT:call void @llvm.masked.store.v8i16.p0v8i16(<8 x i16> 
[[TMP1]], <8 x i16>* [[TMP2]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:[[INDEX_NEXT]] = add i32 [[INDEX]], 8
+; CHECK-NEXT:[[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:br i1 [[TMP3]], label [[WHILE_END]], label [[VECTOR_BODY]], 
[[LOOP0:!llvm.loop !.*]]
+; CHECK:   while.end:
+; CHECK-NEXT:ret void
+;
+entry:
+  %cmp.not6 = icmp eq i32 %blockSize, 0
+  br i1 %cmp.not6, label %while.end, label %while.body
+
+while.body:   ; preds = %entry, %while.body
+  %blkCnt.09 = phi i32 [ %dec, %while.body ], [ %blockSize, %entry ]
+  %pSrc.addr.08 = phi i16* [ %incdec.ptr, %while.body ], [ %pSrc, %entry ]
+  %pDst.addr.07 = phi i16* [ %incdec.ptr3, %while.body ], [ %pDst, %entry ]
+  %incdec.ptr = getelementptr inbounds i16, i16* 

[llvm-branch-commits] [llvm] 045d84f - D94954: Fixes Snapdragon Kryo CPU core detection

2021-01-20 Thread David Green via llvm-branch-commits

Author: Ryan Houdek
Date: 2021-01-20T22:23:43Z
New Revision: 045d84f4e6d7d6bbccaa6d965669a068fc329809

URL: 
https://github.com/llvm/llvm-project/commit/045d84f4e6d7d6bbccaa6d965669a068fc329809
DIFF: 
https://github.com/llvm/llvm-project/commit/045d84f4e6d7d6bbccaa6d965669a068fc329809.diff

LOG: D94954: Fixes Snapdragon Kryo CPU core detection

All of these families were claiming to be a73 based, which was causing
-mcpu/mtune=native to never use the newer features available to these
cores.

Goes through each and bumps the individual cores to their respective Big
counterparts. Since this code path doesn't support big.little detection,
there was already a precedent set with the Qualcomm line to choose the
big cores only.

Adds a comment on each line for the product's name that the part number
refers to. Confirmed on-device and through Linux header naming
convections.

Additionally newer SoCs mix CPU implementer parts from multiple
implementers. Both 0x41 (ARM) and 0x51 (Qualcomm) in the Snapdragon case

This was causing a desync in information where the scan at the start to
find the implementer would mismatch the part scan later on.
Now scan for both implementer and part at the start so these stay in
sync.

Differential Revision: https://reviews.llvm.org/D94954

Added: 


Modified: 
llvm/lib/Support/Host.cpp
llvm/unittests/Support/Host.cpp

Removed: 




diff  --git a/llvm/lib/Support/Host.cpp b/llvm/lib/Support/Host.cpp
index ea561abb28878..a1bd3cc12f1d1 100644
--- a/llvm/lib/Support/Host.cpp
+++ b/llvm/lib/Support/Host.cpp
@@ -161,11 +161,14 @@ StringRef sys::detail::getHostCPUNameForARM(StringRef 
ProcCpuinfoContent) {
   // Look for the CPU implementer line.
   StringRef Implementer;
   StringRef Hardware;
+  StringRef Part;
   for (unsigned I = 0, E = Lines.size(); I != E; ++I) {
 if (Lines[I].startswith("CPU implementer"))
   Implementer = Lines[I].substr(15).ltrim("\t :");
 if (Lines[I].startswith("Hardware"))
   Hardware = Lines[I].substr(8).ltrim("\t :");
+if (Lines[I].startswith("CPU part"))
+  Part = Lines[I].substr(8).ltrim("\t :");
   }
 
   if (Implementer == "0x41") { // ARM Ltd.
@@ -175,111 +178,89 @@ StringRef sys::detail::getHostCPUNameForARM(StringRef 
ProcCpuinfoContent) {
   return "cortex-a53";
 
 
-// Look for the CPU part line.
-for (unsigned I = 0, E = Lines.size(); I != E; ++I)
-  if (Lines[I].startswith("CPU part"))
-// The CPU part is a 3 digit hexadecimal number with a 0x prefix. The
-// values correspond to the "Part number" in the CP15/c0 register. The
-// contents are specified in the various processor manuals.
-// This corresponds to the Main ID Register in Technical Reference 
Manuals.
-// and is used in programs like sys-utils
-return StringSwitch(Lines[I].substr(8).ltrim("\t :"))
-.Case("0x926", "arm926ej-s")
-.Case("0xb02", "mpcore")
-.Case("0xb36", "arm1136j-s")
-.Case("0xb56", "arm1156t2-s")
-.Case("0xb76", "arm1176jz-s")
-.Case("0xc08", "cortex-a8")
-.Case("0xc09", "cortex-a9")
-.Case("0xc0f", "cortex-a15")
-.Case("0xc20", "cortex-m0")
-.Case("0xc23", "cortex-m3")
-.Case("0xc24", "cortex-m4")
-.Case("0xd22", "cortex-m55")
-.Case("0xd02", "cortex-a34")
-.Case("0xd04", "cortex-a35")
-.Case("0xd03", "cortex-a53")
-.Case("0xd07", "cortex-a57")
-.Case("0xd08", "cortex-a72")
-.Case("0xd09", "cortex-a73")
-.Case("0xd0a", "cortex-a75")
-.Case("0xd0b", "cortex-a76")
-.Case("0xd0d", "cortex-a77")
-.Case("0xd41", "cortex-a78")
-.Case("0xd44", "cortex-x1")
-.Case("0xd0c", "neoverse-n1")
-.Case("0xd49", "neoverse-n2")
-.Default("generic");
+// The CPU part is a 3 digit hexadecimal number with a 0x prefix. The
+// values correspond to the "Part number" in the CP15/c0 register. The
+// contents are specified in the various processor manuals.
+// This corresponds to the Main ID Register in Technical Reference Manuals.
+// and is used in programs like sys-utils
+return StringSwitch(Part)
+.Case("0x926", "arm926ej-s")
+.Case("0xb02", "mpcore")
+.Case("0xb36", "arm1136j-s")
+.Case("0xb56", "arm1156t2-s")
+.Case("0xb76", "arm1176jz-s")
+.Case("0xc08", "cortex-a8")
+.Case("0xc09", "cortex-a9")
+.Case("0xc0f", "cortex-a15")
+.Case("0xc20", "cortex-m0")
+.Case("0xc23", "cortex-m3")
+.Case("0xc24", "cortex-m4")
+.Case("0xd22", "cortex-m55")
+.Case("0xd02", "cortex-a34")
+.Case("0xd04", "cortex-a35")
+.Case("0xd03", "cortex-a53")
+.Case("0xd07", 

[llvm-branch-commits] [llvm] 6a563ee - [ARM] Expand vXi1 VSELECT's

2021-01-19 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-19T17:56:50Z
New Revision: 6a563eef1321f742fa06482f4536cd41fb8e24c7

URL: 
https://github.com/llvm/llvm-project/commit/6a563eef1321f742fa06482f4536cd41fb8e24c7
DIFF: 
https://github.com/llvm/llvm-project/commit/6a563eef1321f742fa06482f4536cd41fb8e24c7.diff

LOG: [ARM] Expand vXi1 VSELECT's

We have no lowering for VSELECT vXi1, vXi1, vXi1, so mark them as
expanded to turn them into a series of logical operations.

Differential Revision: https://reviews.llvm.org/D94946

Added: 
llvm/test/CodeGen/Thumb2/mve-pred-vselect.ll

Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
llvm/test/Analysis/CostModel/ARM/arith-ssat.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 46c5efa2cf2f8..aabfad045d9fa 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -444,6 +444,8 @@ void ARMTargetLowering::addMVEVectorTypes(bool HasMVEFP) {
 setOperationAction(ISD::LOAD, VT, Custom);
 setOperationAction(ISD::STORE, VT, Custom);
 setOperationAction(ISD::TRUNCATE, VT, Custom);
+setOperationAction(ISD::VSELECT, VT, Expand);
+setOperationAction(ISD::SELECT, VT, Expand);
   }
 }
 

diff  --git a/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll 
b/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
index 172df86003569..0a29083f27f5c 100644
--- a/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
+++ b/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
@@ -68,20 +68,20 @@ define i32 @sadd(i32 %arg) {
 ; MVE-RECIP-LABEL: 'sadd'
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: 
%I64 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 undef, i64 undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 74 for instruction: 
%V2I64 = call { <2 x i64>, <2 x i1> } @llvm.sadd.with.overflow.v2i64(<2 x i64> 
undef, <2 x i64> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 166 for instruction: 
%V4I64 = call { <4 x i64>, <4 x i1> } @llvm.sadd.with.overflow.v4i64(<4 x i64> 
undef, <4 x i64> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 582 for instruction: 
%V8I64 = call { <8 x i64>, <8 x i1> } @llvm.sadd.with.overflow.v8i64(<8 x i64> 
undef, <8 x i64> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 242 for instruction: 
%V4I64 = call { <4 x i64>, <4 x i1> } @llvm.sadd.with.overflow.v4i64(<4 x i64> 
undef, <4 x i64> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 866 for instruction: 
%V8I64 = call { <8 x i64>, <8 x i1> } @llvm.sadd.with.overflow.v8i64(<8 x i64> 
undef, <8 x i64> undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: 
%I32 = call { i32, i1 } @llvm.sadd.with.overflow.i32(i32 undef, i32 undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: 
%V4I32 = call { <4 x i32>, <4 x i1> } @llvm.sadd.with.overflow.v4i32(<4 x i32> 
undef, <4 x i32> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: 
%V8I32 = call { <8 x i32>, <8 x i1> } @llvm.sadd.with.overflow.v8i32(<8 x i32> 
undef, <8 x i32> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: 
%V16I32 = call { <16 x i32>, <16 x i1> } @llvm.sadd.with.overflow.v16i32(<16 x 
i32> undef, <16 x i32> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 90 for instruction: 
%V4I32 = call { <4 x i32>, <4 x i1> } @llvm.sadd.with.overflow.v4i32(<4 x i32> 
undef, <4 x i32> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 306 for instruction: 
%V8I32 = call { <8 x i32>, <8 x i1> } @llvm.sadd.with.overflow.v8i32(<8 x i32> 
undef, <8 x i32> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 1122 for 
instruction: %V16I32 = call { <16 x i32>, <16 x i1> } 
@llvm.sadd.with.overflow.v16i32(<16 x i32> undef, <16 x i32> undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: 
%I16 = call { i16, i1 } @llvm.sadd.with.overflow.i16(i16 undef, i16 undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: 
%V8I16 = call { <8 x i16>, <8 x i1> } @llvm.sadd.with.overflow.v8i16(<8 x i16> 
undef, <8 x i16> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 22 for instruction: 
%V16I16 = call { <16 x i16>, <16 x i1> } @llvm.sadd.with.overflow.v16i16(<16 x 
i16> undef, <16 x i16> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 44 for instruction: 
%V32I16 = call { <32 x i16>, <32 x i1> } @llvm.sadd.with.overflow.v32i16(<32 x 
i16> undef, <32 x i16> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 298 for instruction: 
%V8I16 = call { <8 x i16>, <8 x i1> } @llvm.sadd.with.overflow.v8i16(<8 x i16> 
undef, <8 x i16> 

[llvm-branch-commits] [llvm] f373b30 - [ARM] Add MVE add.sat costs

2021-01-19 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-19T15:38:46Z
New Revision: f373b30923d7a83985e59ec76a566dd889e684d9

URL: 
https://github.com/llvm/llvm-project/commit/f373b30923d7a83985e59ec76a566dd889e684d9
DIFF: 
https://github.com/llvm/llvm-project/commit/f373b30923d7a83985e59ec76a566dd889e684d9.diff

LOG: [ARM] Add MVE add.sat costs

This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs,
based on when the instruction is legal. With smaller than legal types
that are promoted we generate shr(qadd(shl, shl)), so the cost is 4
appropriately.

Differential Revision: https://reviews.llvm.org/D94958

Added: 


Modified: 
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
llvm/test/Analysis/CostModel/ARM/arith-usat.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp 
b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index a75c771e66be..a94d35118051 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -1513,15 +1513,39 @@ int ARMTTIImpl::getArithmeticReductionCost(unsigned 
Opcode, VectorType *ValTy,
 
 int ARMTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes ,
   TTI::TargetCostKind CostKind) {
-  // Currently we make a somewhat optimistic assumption that active_lane_mask's
-  // are always free. In reality it may be freely folded into a tail predicated
-  // loop, expanded into a VCPT or expanded into a lot of add/icmp code. We
-  // may need to improve this in the future, but being able to detect if it
-  // is free or not involves looking at a lot of other code. We currently 
assume
-  // that the vectorizer inserted these, and knew what it was doing in adding
-  // one.
-  if (ST->hasMVEIntegerOps() && ICA.getID() == Intrinsic::get_active_lane_mask)
-return 0;
+  switch (ICA.getID()) {
+  case Intrinsic::get_active_lane_mask:
+// Currently we make a somewhat optimistic assumption that
+// active_lane_mask's are always free. In reality it may be freely folded
+// into a tail predicated loop, expanded into a VCPT or expanded into a lot
+// of add/icmp code. We may need to improve this in the future, but being
+// able to detect if it is free or not involves looking at a lot of other
+// code. We currently assume that the vectorizer inserted these, and knew
+// what it was doing in adding one.
+if (ST->hasMVEIntegerOps())
+  return 0;
+break;
+  case Intrinsic::sadd_sat:
+  case Intrinsic::ssub_sat:
+  case Intrinsic::uadd_sat:
+  case Intrinsic::usub_sat: {
+if (!ST->hasMVEIntegerOps())
+  break;
+std::pair LT =
+TLI->getTypeLegalizationCost(DL, ICA.getReturnType());
+if (LT.second == MVT::v4i32 || LT.second == MVT::v8i16 ||
+LT.second == MVT::v16i8) {
+  // This is a base cost of 1 for the vadd, plus 3 extract shifts if we
+  // need to extend the type, as it uses shr(qadd(shl, shl)).
+  unsigned Instrs = LT.second.getScalarSizeInBits() ==
+ICA.getReturnType()->getScalarSizeInBits()
+? 1
+: 4;
+  return LT.first * ST->getMVEVectorCostFactor() * Instrs;
+}
+break;
+  }
+  }
 
   return BaseT::getIntrinsicInstrCost(ICA, CostKind);
 }

diff  --git a/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll 
b/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
index 13c1a148b249..bc8b23bc001f 100644
--- a/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
+++ b/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
@@ -90,22 +90,22 @@ define i32 @add(i32 %arg) {
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 1046 for 
instruction: %V8I64 = call <8 x i64> @llvm.sadd.sat.v8i64(<8 x i64> undef, <8 x 
i64> undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: 
%I32 = call i32 @llvm.sadd.sat.i32(i32 undef, i32 undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 108 for instruction: 
%V2I32 = call <2 x i32> @llvm.sadd.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: 
%V4I32 = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: 
%V8I32 = call <8 x i32> @llvm.sadd.sat.v8i32(<8 x i32> undef, <8 x i32> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 62 for instruction: 
%V16I32 = call <16 x i32> @llvm.sadd.sat.v16i32(<16 x i32> undef, <16 x i32> 
undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: 
%V4I32 = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: 
%V8I32 = call <8 x i32> @llvm.sadd.sat.v8i32(<8 x i32> undef, 

[llvm-branch-commits] [llvm] 54e3844 - [ARM] Expand add.sat/sub.sat cost checks. NFC

2021-01-19 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-19T15:06:06Z
New Revision: 54e38440e74f98ec58a22d7d8f9fc5e550ce65aa

URL: 
https://github.com/llvm/llvm-project/commit/54e38440e74f98ec58a22d7d8f9fc5e550ce65aa
DIFF: 
https://github.com/llvm/llvm-project/commit/54e38440e74f98ec58a22d7d8f9fc5e550ce65aa.diff

LOG: [ARM] Expand add.sat/sub.sat cost checks. NFC

Added: 


Modified: 
llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
llvm/test/Analysis/CostModel/ARM/arith-usat.ll

Removed: 




diff  --git a/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll 
b/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
index 66c99d804b26..13c1a148b249 100644
--- a/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
+++ b/llvm/test/Analysis/CostModel/ARM/arith-ssat.ll
@@ -12,16 +12,22 @@ declare <4 x i64>  @llvm.sadd.sat.v4i64(<4 x i64>, <4 x 
i64>)
 declare <8 x i64>  @llvm.sadd.sat.v8i64(<8 x i64>, <8 x i64>)
 
 declare i32@llvm.sadd.sat.i32(i32, i32)
+declare <2 x i32>  @llvm.sadd.sat.v2i32(<2 x i32>, <2 x i32>)
 declare <4 x i32>  @llvm.sadd.sat.v4i32(<4 x i32>, <4 x i32>)
 declare <8 x i32>  @llvm.sadd.sat.v8i32(<8 x i32>, <8 x i32>)
 declare <16 x i32> @llvm.sadd.sat.v16i32(<16 x i32>, <16 x i32>)
 
 declare i16@llvm.sadd.sat.i16(i16, i16)
+declare <2 x i16>  @llvm.sadd.sat.v2i16(<2 x i16>, <2 x i16>)
+declare <4 x i16>  @llvm.sadd.sat.v4i16(<4 x i16>, <4 x i16>)
 declare <8 x i16>  @llvm.sadd.sat.v8i16(<8 x i16>, <8 x i16>)
 declare <16 x i16> @llvm.sadd.sat.v16i16(<16 x i16>, <16 x i16>)
 declare <32 x i16> @llvm.sadd.sat.v32i16(<32 x i16>, <32 x i16>)
 
 declare i8 @llvm.sadd.sat.i8(i8,  i8)
+declare <2 x i8>   @llvm.sadd.sat.v2i8(<2 x i8>, <2 x i8>)
+declare <4 x i8>   @llvm.sadd.sat.v4i8(<4 x i8>, <4 x i8>)
+declare <8 x i8>   @llvm.sadd.sat.v8i8(<8 x i8>, <8 x i8>)
 declare <16 x i8>  @llvm.sadd.sat.v16i8(<16 x i8>, <16 x i8>)
 declare <32 x i8>  @llvm.sadd.sat.v32i8(<32 x i8>, <32 x i8>)
 declare <64 x i8>  @llvm.sadd.sat.v64i8(<64 x i8>, <64 x i8>)
@@ -33,14 +39,20 @@ define i32 @add(i32 %arg) {
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 108 for instruction: 
%V4I64 = call <4 x i64> @llvm.sadd.sat.v4i64(<4 x i64> undef, <4 x i64> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 216 for instruction: 
%V8I64 = call <8 x i64> @llvm.sadd.sat.v8i64(<8 x i64> undef, <8 x i64> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: 
%I32 = call i32 @llvm.sadd.sat.i32(i32 undef, i32 undef)
+; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: 
%V2I32 = call <2 x i32> @llvm.sadd.sat.v2i32(<2 x i32> undef, <2 x i32> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 72 for instruction: 
%V4I32 = call <4 x i32> @llvm.sadd.sat.v4i32(<4 x i32> undef, <4 x i32> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 144 for instruction: 
%V8I32 = call <8 x i32> @llvm.sadd.sat.v8i32(<8 x i32> undef, <8 x i32> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 288 for instruction: 
%V16I32 = call <16 x i32> @llvm.sadd.sat.v16i32(<16 x i32> undef, <16 x i32> 
undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: 
%I16 = call i16 @llvm.sadd.sat.i16(i16 undef, i16 undef)
+; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: 
%V2I16 = call <2 x i16> @llvm.sadd.sat.v2i16(<2 x i16> undef, <2 x i16> undef)
+; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 72 for instruction: 
%V4I16 = call <4 x i16> @llvm.sadd.sat.v4i16(<4 x i16> undef, <4 x i16> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 144 for instruction: 
%V8I16 = call <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16> undef, <8 x i16> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 288 for instruction: 
%V16I16 = call <16 x i16> @llvm.sadd.sat.v16i16(<16 x i16> undef, <16 x i16> 
undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 576 for instruction: 
%V32I16 = call <32 x i16> @llvm.sadd.sat.v32i16(<32 x i16> undef, <32 x i16> 
undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: 
%I8 = call i8 @llvm.sadd.sat.i8(i8 undef, i8 undef)
+; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: 
%V2I8 = call <2 x i8> @llvm.sadd.sat.v2i8(<2 x i8> undef, <2 x i8> undef)
+; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 72 for instruction: 
%V4I8 = call <4 x i8> @llvm.sadd.sat.v4i8(<4 x i8> undef, <4 x i8> undef)
+; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 144 for instruction: 
%V8I8 = call <8 x i8> @llvm.sadd.sat.v8i8(<8 x i8> undef, <8 x i8> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 288 for instruction: 
%V16I8 = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> undef, <16 x i8> undef)
 ; V8M-RECIP-NEXT:  Cost Model: Found an estimated cost of 576 for 

[llvm-branch-commits] [llvm] e7dc083 - [ARM] Don't handle low overhead branches in AnalyzeBranch

2021-01-18 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-18T17:16:07Z
New Revision: e7dc083a410f187e143138b4956993370626268b

URL: 
https://github.com/llvm/llvm-project/commit/e7dc083a410f187e143138b4956993370626268b
DIFF: 
https://github.com/llvm/llvm-project/commit/e7dc083a410f187e143138b4956993370626268b.diff

LOG: [ARM] Don't handle low overhead branches in AnalyzeBranch

It turns our that the BranchFolder and IfCvt does not like unanalyzable
branches that fall-through. This means that removing the unconditional
branches from the end of tail predicated instruction can run into
asserts and verifier issues.

This effectively reverts 372eb2bbb6fb903ce76266e659dfefbaee67722b, but
adds handling to t2DoLoopEndDec which are not branches, so can be safely
skipped.

Added: 
llvm/test/CodeGen/Thumb2/mve-blockplacement.ll

Modified: 
llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/sibling-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/vcmp-vpst-combination.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
llvm/test/CodeGen/Thumb2/aligned-nonfallthrough.ll
llvm/test/CodeGen/Thumb2/mve-float16regloops.ll
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/Thumb2/mve-gather-increment.ll
llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll
llvm/test/CodeGen/Thumb2/mve-gather-tailpred.ll
llvm/test/CodeGen/Thumb2/mve-satmul-loops.ll
llvm/test/CodeGen/Thumb2/mve-scatter-increment.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll
llvm/test/CodeGen/Thumb2/mve-vldshuffle.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
index 54586e0c256b..143bf6641e6f 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
@@ -367,15 +367,15 @@ bool ARMBaseInstrInfo::analyzeBranch(MachineBasicBlock 
,
 // Skip over DEBUG values, predicated nonterminators and speculation
 // barrier terminators.
 while (I->isDebugInstr() || !I->isTerminator() ||
-   isSpeculationBarrierEndBBOpcode(I->getOpcode()) ){
+   isSpeculationBarrierEndBBOpcode(I->getOpcode()) ||
+   I->getOpcode() == ARM::t2DoLoopStartTP){
   if (I == MBB.instr_begin())
 return false;
   --I;
 }
 
 if (isIndirectBranchOpcode(I->getOpcode()) ||
-isJumpTableBranchOpcode(I->getOpcode()) ||
-isLowOverheadTerminatorOpcode(I->getOpcode())) {
+isJumpTableBranchOpcode(I->getOpcode())) {
   // Indirect branches and jump tables can't be analyzed, but we still want
   // to clean up any instructions at the tail of the basic block.
   CantAnalyze = true;

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
index ec574ad827a4..fec6ff7c2154 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
@@ -330,9 +330,9 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(half* 
nocapture readonly %b, h
 ; CHECK-NEXT:vdup.32 q1, r12
 ; CHECK-NEXT:vdup.32 q2, r12
 ; CHECK-NEXT:vstrw.32 q0, [sp] @ 16-byte Spill
-; CHECK-NEXT:b .LBB2_5
+; CHECK-NEXT:b .LBB2_4
 ; CHECK-NEXT:  .LBB2_2: @ %cond.load25
-; CHECK-NEXT:@ in Loop: Header=BB2_5 Depth=1
+; CHECK-NEXT:@ in Loop: Header=BB2_4 Depth=1
 ; CHECK-NEXT:vmovx.f16 s0, s28
 ; CHECK-NEXT:vmov r4, s28
 ; CHECK-NEXT:vmov r2, s0
@@ -344,7 +344,7 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(half* 
nocapture readonly %b, h
 ; CHECK-NEXT:vmov r2, s0
 ; CHECK-NEXT:vmov.16 q6[3], r2
 ; CHECK-NEXT:  .LBB2_3: @ %else26
-; CHECK-NEXT:@ in Loop: Header=BB2_5 Depth=1
+; CHECK-NEXT:@ in Loop: Header=BB2_4 Depth=1
 ; CHECK-NEXT:vmul.f16 q0, q6, q5
 ; CHECK-NEXT:adds r0, #8
 ; CHECK-NEXT:vcvtt.f32.f16 s23, s1
@@ -355,18 +355,9 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(half* 
nocapture readonly %b, h
 ; CHECK-NEXT:vcvtb.f32.f16 s20, s0
 ; CHECK-NEXT:vadd.f32 q5, q3, q5
 ; CHECK-NEXT:subs.w lr, lr, #1
-; CHECK-NEXT:bne .LBB2_5
-; CHECK-NEXT:  @ %bb.4: @ %middle.block
-; CHECK-NEXT:vdup.32 q0, r12
-; CHECK-NEXT:vcmp.u32 cs, q0, q4
-; CHECK-NEXT:vpsel q0, q5, q3
-; CHECK-NEXT:vmov.f32 s4, s2
-; CHECK-NEXT:vmov.f32 s5, s3
-; CHECK-NEXT:vadd.f32 q0, q0, q1
-; CHECK-NEXT:vmov r0, s1
-; CHECK-NEXT:vadd.f32 q0, q0, r0
-; CHECK-NEXT:b .LBB2_23
-; CHECK-NEXT:  .LBB2_5: @ %vector.body
+; CHECK-NEXT:bne .LBB2_4
+; CHECK-NEXT:b .LBB2_21
+; CHECK-NEXT:  .LBB2_4: @ %vector.body
 ; CHECK-NEXT: 

[llvm-branch-commits] [llvm] 6929581 - [ARM] Update test target triple. NFC

2021-01-18 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-18T16:36:00Z
New Revision: 69295815ed92cc125f7ae0a0c41c99fd507dad9d

URL: 
https://github.com/llvm/llvm-project/commit/69295815ed92cc125f7ae0a0c41c99fd507dad9d
DIFF: 
https://github.com/llvm/llvm-project/commit/69295815ed92cc125f7ae0a0c41c99fd507dad9d.diff

LOG: [ARM] Update test target triple. NFC

Added: 


Modified: 
llvm/test/CodeGen/ARM/ParallelDSP/aliasing.ll
llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad0.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad1.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad10.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad11.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad12.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad2.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad3.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad4.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad5.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad8.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlad9.ll
llvm/test/CodeGen/ARM/ParallelDSP/smladx-1.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlald0.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlald1.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlald2.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlaldx-1.ll
llvm/test/CodeGen/ARM/ParallelDSP/smlaldx-2.ll

Removed: 




diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/aliasing.ll 
b/llvm/test/CodeGen/ARM/ParallelDSP/aliasing.ll
index 4edf5bfbbef0..6147b58f0c9e 100644
--- a/llvm/test/CodeGen/ARM/ParallelDSP/aliasing.ll
+++ b/llvm/test/CodeGen/ARM/ParallelDSP/aliasing.ll
@@ -1,4 +1,4 @@
-; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp 
-verify -S | FileCheck %s
+; RUN: opt -mtriple=arm-none-none-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp 
-verify -S | FileCheck %s
 ;
 ; Alias check: check that the rewrite isn't triggered when there's a store
 ; instruction possibly aliasing any mul load operands; arguments are passed

diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll 
b/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
index ea7a83b96b46..4a5e0cba8634 100644
--- a/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
+++ b/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -O3 -mtriple=arm-arm-eabi -mcpu=cortex-m33 < %s | FileCheck %s 
--check-prefixes=CHECK-LE
-; RUN: llc -O3 -mtriple=armeb-arm-eabi -mcpu=cortex-m33 < %s | FileCheck %s 
--check-prefixes=CHECK-BE
+; RUN: llc -O3 -mtriple=arm-none-none-eabi -mcpu=cortex-m33 < %s | FileCheck 
%s --check-prefixes=CHECK-LE
+; RUN: llc -O3 -mtriple=armeb-none-none-eabi -mcpu=cortex-m33 < %s | FileCheck 
%s --check-prefixes=CHECK-BE
 
 define i32 @add_user(i32 %arg, i32* nocapture readnone %arg1, i16* nocapture 
readonly %arg2, i16* nocapture readonly %arg3) {
 ; CHECK-LE-LABEL: add_user:

diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/smlad0.ll 
b/llvm/test/CodeGen/ARM/ParallelDSP/smlad0.ll
index 5b3207d85323..63398fd46a70 100644
--- a/llvm/test/CodeGen/ARM/ParallelDSP/smlad0.ll
+++ b/llvm/test/CodeGen/ARM/ParallelDSP/smlad0.ll
@@ -1,11 +1,11 @@
-; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp -S | 
FileCheck %s
+; RUN: opt -mtriple=arm-none-none-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp 
-S | FileCheck %s
 ; RUN: opt -mtriple=armeb-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S | 
FileCheck %s --check-prefix=CHECK-UNSUPPORTED
 ;
 ; The Cortex-M0 does not support unaligned accesses:
-; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp -S | 
FileCheck %s --check-prefix=CHECK-UNSUPPORTED
+; RUN: opt -mtriple=arm-none-none-eabi -mcpu=cortex-m0 < %s -arm-parallel-dsp 
-S | FileCheck %s --check-prefix=CHECK-UNSUPPORTED
 ;
 ; Check DSP extension:
-; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 -mattr=-dsp < %s 
-arm-parallel-dsp -S | FileCheck %s --check-prefix=CHECK-UNSUPPORTED
+; RUN: opt -mtriple=arm-none-none-eabi -mcpu=cortex-m33 -mattr=-dsp < %s 
-arm-parallel-dsp -S | FileCheck %s --check-prefix=CHECK-UNSUPPORTED
 
 define dso_local i32 @OneReduction(i32 %arg, i32* nocapture readnone %arg1, 
i16* nocapture readonly %arg2, i16* nocapture readonly %arg3) {
 ;

diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/smlad1.ll 
b/llvm/test/CodeGen/ARM/ParallelDSP/smlad1.ll
index 6bce049eafb9..cd6ad7dc0f24 100644
--- a/llvm/test/CodeGen/ARM/ParallelDSP/smlad1.ll
+++ b/llvm/test/CodeGen/ARM/ParallelDSP/smlad1.ll
@@ -1,4 +1,4 @@
-; RUN: opt -mtriple=arm-arm-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp -S | 
FileCheck %s
+; RUN: opt -mtriple=arm-none-none-eabi -mcpu=cortex-m33 < %s -arm-parallel-dsp 
-S | FileCheck %s
 
 ; CHECK-LABEL: @test1
 ; CHECK:  %mac1{{\.}}026 = phi i32 [ [[V8:%[0-9]+]], %for.body ], [ 0, 
%for.body.preheader ]

diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/smlad10.ll 

[llvm-branch-commits] [llvm] 2a5b576 - [ARM] Test for aligned blocks. NFC

2021-01-16 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-16T22:04:48Z
New Revision: 2a5b576e3ea41c30537435d989a3dce7a409f8e2

URL: 
https://github.com/llvm/llvm-project/commit/2a5b576e3ea41c30537435d989a3dce7a409f8e2
DIFF: 
https://github.com/llvm/llvm-project/commit/2a5b576e3ea41c30537435d989a3dce7a409f8e2.diff

LOG: [ARM] Test for aligned blocks. NFC

Added: 
llvm/test/CodeGen/Thumb2/aligned-nonfallthrough.ll

Modified: 


Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/aligned-nonfallthrough.ll 
b/llvm/test/CodeGen/Thumb2/aligned-nonfallthrough.ll
new file mode 100644
index ..90bf4df53f30
--- /dev/null
+++ b/llvm/test/CodeGen/Thumb2/aligned-nonfallthrough.ll
@@ -0,0 +1,88 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=thumbv8.1-m.main-none-eabi -mcpu=cortex-m55 -O3 < %s | 
FileCheck %s
+
+define i32 @loop(i32* nocapture readonly %x) {
+; CHECK-LABEL: loop:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r7, lr}
+; CHECK-NEXT:push {r7, lr}
+; CHECK-NEXT:mov.w lr, #500
+; CHECK-NEXT:dls lr, lr
+; CHECK-NEXT:movs r1, #0
+; CHECK-NEXT:.p2align 2
+; CHECK-NEXT:  .LBB0_1: @ %for.body
+; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:ldr r2, [r0], #4
+; CHECK-NEXT:add r1, r2
+; CHECK-NEXT:le lr, .LBB0_1
+; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
+; CHECK-NEXT:mov r0, r1
+; CHECK-NEXT:pop {r7, pc}
+entry:
+  br label %for.body
+
+for.cond.cleanup: ; preds = %for.body
+  ret i32 %add
+
+for.body: ; preds = %entry, %for.body
+  %i.07 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
+  %s.06 = phi i32 [ 0, %entry ], [ %add, %for.body ]
+  %arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.07
+  %0 = load i32, i32* %arrayidx, align 4
+  %add = add nsw i32 %0, %s.06
+  %inc = add nuw nsw i32 %i.07, 1
+  %exitcond.not = icmp eq i32 %inc, 500
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+
+define i64 @loopif(i32* nocapture readonly %x, i32 %y, i32 %n) {
+; CHECK-LABEL: loopif:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r7, lr}
+; CHECK-NEXT:push {r7, lr}
+; CHECK-NEXT:cmp r2, #1
+; CHECK-NEXT:blt .LBB1_4
+; CHECK-NEXT:  @ %bb.1: @ %for.body.lr.ph
+; CHECK-NEXT:mov lr, r2
+; CHECK-NEXT:dls lr, r2
+; CHECK-NEXT:mov r12, r0
+; CHECK-NEXT:movs r0, #0
+; CHECK-NEXT:movs r3, #0
+; CHECK-NEXT:.p2align 2
+; CHECK-NEXT:  .LBB1_2: @ %for.body
+; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:ldr r2, [r12], #4
+; CHECK-NEXT:smlal r0, r3, r2, r1
+; CHECK-NEXT:le lr, .LBB1_2
+; CHECK-NEXT:  @ %bb.3: @ %for.cond.cleanup
+; CHECK-NEXT:mov r1, r3
+; CHECK-NEXT:pop {r7, pc}
+; CHECK-NEXT:  .LBB1_4:
+; CHECK-NEXT:movs r0, #0
+; CHECK-NEXT:movs r3, #0
+; CHECK-NEXT:mov r1, r3
+; CHECK-NEXT:pop {r7, pc}
+entry:
+  %cmp7 = icmp sgt i32 %n, 0
+  br i1 %cmp7, label %for.body.lr.ph, label %for.cond.cleanup
+
+for.body.lr.ph:   ; preds = %entry
+  %conv1 = sext i32 %y to i64
+  br label %for.body
+
+for.cond.cleanup: ; preds = %for.body, %entry
+  %s.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
+  ret i64 %s.0.lcssa
+
+for.body: ; preds = %for.body.lr.ph, 
%for.body
+  %i.09 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
+  %s.08 = phi i64 [ 0, %for.body.lr.ph ], [ %add, %for.body ]
+  %arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.09
+  %0 = load i32, i32* %arrayidx, align 4
+  %conv = sext i32 %0 to i64
+  %mul = mul nsw i64 %conv, %conv1
+  %add = add nsw i64 %mul, %s.08
+  %inc = add nuw nsw i32 %i.09, 1
+  %exitcond.not = icmp eq i32 %inc, %n
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] 1454724 - [ARM] Align blocks that are not fallthough targets

2021-01-16 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-16T22:19:35Z
New Revision: 145472421535c71a9ea60af7e5d012ab69dc85ff

URL: 
https://github.com/llvm/llvm-project/commit/145472421535c71a9ea60af7e5d012ab69dc85ff
DIFF: 
https://github.com/llvm/llvm-project/commit/145472421535c71a9ea60af7e5d012ab69dc85ff.diff

LOG: [ARM] Align blocks that are not fallthough targets

If the previous block in a function does not fallthough, adding nop's to
align it will never be executed. This means we can freely (except for
codesize) align more branches. This happens in constantislandspass (as
it cannot happen later) and only happens at aggressive optimization
levels as it does increase codesize.

Differential Revision: https://reviews.llvm.org/D94394

Added: 


Modified: 
llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
llvm/test/CodeGen/Thumb2/aligned-nonfallthrough.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp 
b/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
index e89eb0fb4502..630490f6f914 100644
--- a/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
+++ b/llvm/lib/Target/ARM/ARMConstantIslandPass.cpp
@@ -338,6 +338,32 @@ LLVM_DUMP_METHOD void ARMConstantIslands::dumpBBs() {
 }
 #endif
 
+// Align blocks where the previous block does not fall through. This may add
+// extra NOP's but they will not be executed. It uses the PrefLoopAlignment as 
a
+// measure of how much to align, and only runs at CodeGenOpt::Aggressive.
+static bool AlignBlocks(MachineFunction *MF) {
+  if (MF->getTarget().getOptLevel() != CodeGenOpt::Aggressive ||
+  MF->getFunction().hasOptSize())
+return false;
+
+  auto *TLI = MF->getSubtarget().getTargetLowering();
+  const Align Alignment = TLI->getPrefLoopAlignment();
+  if (Alignment < 4)
+return false;
+
+  bool Changed = false;
+  bool PrevCanFallthough = true;
+  for (auto  : *MF) {
+if (!PrevCanFallthough) {
+  Changed = true;
+  MBB.setAlignment(Alignment);
+}
+PrevCanFallthough = MBB.canFallThrough();
+  }
+
+  return Changed;
+}
+
 bool ARMConstantIslands::runOnMachineFunction(MachineFunction ) {
   MF = 
   MCP = mf.getConstantPool();
@@ -380,6 +406,9 @@ bool 
ARMConstantIslands::runOnMachineFunction(MachineFunction ) {
 MF->RenumberBlocks();
   }
 
+  // Align any non-fallthrough blocks
+  MadeChange |= AlignBlocks(MF);
+
   // Perform the initial placement of the constant pool entries.  To start 
with,
   // we put them all at the end of the function.
   std::vector CPEMIs;

diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll 
b/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
index b949934e51df..ea7a83b96b46 100644
--- a/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
+++ b/llvm/test/CodeGen/ARM/ParallelDSP/multi-use-loads.ll
@@ -26,6 +26,7 @@ define i32 @add_user(i32 %arg, i32* nocapture readnone %arg1, 
i16* nocapture rea
 ; CHECK-LE-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-LE-NEXT:add.w r0, r12, r1
 ; CHECK-LE-NEXT:pop {r4, pc}
+; CHECK-LE-NEXT:.p2align 2
 ; CHECK-LE-NEXT:  .LBB0_4:
 ; CHECK-LE-NEXT:mov.w r12, #0
 ; CHECK-LE-NEXT:movs r1, #0
@@ -58,6 +59,7 @@ define i32 @add_user(i32 %arg, i32* nocapture readnone %arg1, 
i16* nocapture rea
 ; CHECK-BE-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-BE-NEXT:add.w r0, r12, r1
 ; CHECK-BE-NEXT:pop {r4, r5, r6, pc}
+; CHECK-BE-NEXT:.p2align 2
 ; CHECK-BE-NEXT:  .LBB0_4:
 ; CHECK-BE-NEXT:mov.w r12, #0
 ; CHECK-BE-NEXT:movs r1, #0
@@ -129,6 +131,7 @@ define i32 @mul_bottom_user(i32 %arg, i32* nocapture 
readnone %arg1, i16* nocapt
 ; CHECK-LE-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-LE-NEXT:add.w r0, r12, r1
 ; CHECK-LE-NEXT:pop {r4, r5, r7, pc}
+; CHECK-LE-NEXT:.p2align 2
 ; CHECK-LE-NEXT:  .LBB1_4:
 ; CHECK-LE-NEXT:mov.w r12, #0
 ; CHECK-LE-NEXT:movs r1, #0
@@ -161,6 +164,7 @@ define i32 @mul_bottom_user(i32 %arg, i32* nocapture 
readnone %arg1, i16* nocapt
 ; CHECK-BE-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-BE-NEXT:add.w r0, r12, r1
 ; CHECK-BE-NEXT:pop {r4, r5, r6, pc}
+; CHECK-BE-NEXT:.p2align 2
 ; CHECK-BE-NEXT:  .LBB1_4:
 ; CHECK-BE-NEXT:mov.w r12, #0
 ; CHECK-BE-NEXT:movs r1, #0
@@ -232,6 +236,7 @@ define i32 @mul_top_user(i32 %arg, i32* nocapture readnone 
%arg1, i16* nocapture
 ; CHECK-LE-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-LE-NEXT:add.w r0, r12, r1
 ; CHECK-LE-NEXT:pop {r4, pc}
+; CHECK-LE-NEXT:.p2align 2
 ; CHECK-LE-NEXT:  .LBB2_4:
 ; CHECK-LE-NEXT:mov.w r12, #0
 ; CHECK-LE-NEXT:movs r1, #0
@@ -264,6 +269,7 @@ define i32 @mul_top_user(i32 %arg, i32* nocapture readnone 
%arg1, i16* nocapture
 ; CHECK-BE-NEXT:  @ %bb.3: @ %for.cond.cleanup
 ; CHECK-BE-NEXT:add.w r0, r12, r1
 ; CHECK-BE-NEXT:pop {r4, r5, r6, pc}
+; CHECK-BE-NEXT:.p2align 2
 ; CHECK-BE-NEXT:  .LBB2_4:
 ; CHECK-BE-NEXT: 

[llvm-branch-commits] [llvm] 372eb2b - [ARM] Add low overhead loops terminators to AnalyzeBranch

2021-01-16 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-16T18:30:21Z
New Revision: 372eb2bbb6fb903ce76266e659dfefbaee67722b

URL: 
https://github.com/llvm/llvm-project/commit/372eb2bbb6fb903ce76266e659dfefbaee67722b
DIFF: 
https://github.com/llvm/llvm-project/commit/372eb2bbb6fb903ce76266e659dfefbaee67722b.diff

LOG: [ARM] Add low overhead loops terminators to AnalyzeBranch

This treats low overhead loop branches the same as jump tables and
indirect branches in analyzeBranch - they cannot be analyzed but the
direct branches on the end of the block may be removed. This helps
remove the unnecessary branches earlier, which can help produce better
codegen (and change block layout in a number of cases).

Differential Revision: https://reviews.llvm.org/D94392

Added: 


Modified: 
llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/sibling-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/vcmp-vpst-combination.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
llvm/test/CodeGen/Thumb2/mve-float16regloops.ll
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/Thumb2/mve-gather-increment.ll
llvm/test/CodeGen/Thumb2/mve-gather-scatter-optimisation.ll
llvm/test/CodeGen/Thumb2/mve-gather-tailpred.ll
llvm/test/CodeGen/Thumb2/mve-satmul-loops.ll
llvm/test/CodeGen/Thumb2/mve-scatter-increment.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll
llvm/test/CodeGen/Thumb2/mve-vldshuffle.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
index fa564f50f679..54586e0c256b 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
@@ -374,7 +374,8 @@ bool ARMBaseInstrInfo::analyzeBranch(MachineBasicBlock ,
 }
 
 if (isIndirectBranchOpcode(I->getOpcode()) ||
-isJumpTableBranchOpcode(I->getOpcode())) {
+isJumpTableBranchOpcode(I->getOpcode()) ||
+isLowOverheadTerminatorOpcode(I->getOpcode())) {
   // Indirect branches and jump tables can't be analyzed, but we still want
   // to clean up any instructions at the tail of the basic block.
   CantAnalyze = true;

diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
index deb008025b1d..b14f7e480856 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
@@ -634,6 +634,11 @@ static inline bool isJumpTableBranchOpcode(int Opc) {
  Opc == ARM::t2BR_JT;
 }
 
+static inline bool isLowOverheadTerminatorOpcode(int Opc) {
+  return Opc == ARM::t2DoLoopStartTP || Opc == ARM::t2WhileLoopStart ||
+ Opc == ARM::t2LoopEnd || Opc == ARM::t2LoopEndDec;
+}
+
 static inline
 bool isIndirectBranchOpcode(int Opc) {
   return Opc == ARM::BX || Opc == ARM::MOVPCRX || Opc == ARM::tBRIND;

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
index fec6ff7c2154..ec574ad827a4 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
@@ -330,9 +330,9 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(half* 
nocapture readonly %b, h
 ; CHECK-NEXT:vdup.32 q1, r12
 ; CHECK-NEXT:vdup.32 q2, r12
 ; CHECK-NEXT:vstrw.32 q0, [sp] @ 16-byte Spill
-; CHECK-NEXT:b .LBB2_4
+; CHECK-NEXT:b .LBB2_5
 ; CHECK-NEXT:  .LBB2_2: @ %cond.load25
-; CHECK-NEXT:@ in Loop: Header=BB2_4 Depth=1
+; CHECK-NEXT:@ in Loop: Header=BB2_5 Depth=1
 ; CHECK-NEXT:vmovx.f16 s0, s28
 ; CHECK-NEXT:vmov r4, s28
 ; CHECK-NEXT:vmov r2, s0
@@ -344,7 +344,7 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(half* 
nocapture readonly %b, h
 ; CHECK-NEXT:vmov r2, s0
 ; CHECK-NEXT:vmov.16 q6[3], r2
 ; CHECK-NEXT:  .LBB2_3: @ %else26
-; CHECK-NEXT:@ in Loop: Header=BB2_4 Depth=1
+; CHECK-NEXT:@ in Loop: Header=BB2_5 Depth=1
 ; CHECK-NEXT:vmul.f16 q0, q6, q5
 ; CHECK-NEXT:adds r0, #8
 ; CHECK-NEXT:vcvtt.f32.f16 s23, s1
@@ -355,9 +355,18 @@ define arm_aapcs_vfpcc float @fast_float_half_mac(half* 
nocapture readonly %b, h
 ; CHECK-NEXT:vcvtb.f32.f16 s20, s0
 ; CHECK-NEXT:vadd.f32 q5, q3, q5
 ; CHECK-NEXT:subs.w lr, lr, #1
-; CHECK-NEXT:bne .LBB2_4
-; CHECK-NEXT:b .LBB2_21
-; CHECK-NEXT:  .LBB2_4: @ %vector.body
+; CHECK-NEXT:bne .LBB2_5
+; CHECK-NEXT:  @ %bb.4: @ %middle.block
+; CHECK-NEXT:vdup.32 q0, r12
+; CHECK-NEXT:vcmp.u32 cs, q0, q4
+; CHECK-NEXT:vpsel q0, q5, q3
+; CHECK-NEXT:vmov.f32 s4, s2
+; CHECK-NEXT:vmov.f32 s5, s3

[llvm-branch-commits] [llvm] c1ab698 - [ARM] Remove LLC tests from transform/hardware loop tests.

2021-01-16 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-16T18:30:21Z
New Revision: c1ab698dce8dd4e751e63142ebb333d5b90bb8dc

URL: 
https://github.com/llvm/llvm-project/commit/c1ab698dce8dd4e751e63142ebb333d5b90bb8dc
DIFF: 
https://github.com/llvm/llvm-project/commit/c1ab698dce8dd4e751e63142ebb333d5b90bb8dc.diff

LOG: [ARM] Remove LLC tests from transform/hardware loop tests.

We now have a lot of llc tests for hardware loops in CodeGen, which test
a larger variety of loops and are easier to maintain. This removes the
llc from mixed llc/opt tests.

Added: 


Modified: 
llvm/test/Transforms/HardwareLoops/ARM/structure.ll

Removed: 




diff  --git a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll 
b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
index 480823fe7db8..f8ef14e2da4d 100644
--- a/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
+++ b/llvm/test/Transforms/HardwareLoops/ARM/structure.ll
@@ -1,7 +1,5 @@
 ; RUN: opt -mtriple=thumbv8.1m.main-none-none-eabi -hardware-loops %s -S -o - 
| \
 ; RUN: FileCheck %s
-; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi %s -o - | \
-; RUN: FileCheck %s --check-prefix=CHECK-LLC
 ; RUN: opt -mtriple=thumbv8.1m.main -loop-unroll -unroll-remainder=false -S < 
%s | \
 ; RUN: llc -mtriple=thumbv8.1m.main | FileCheck %s 
--check-prefix=CHECK-UNROLL
 ; RUN: opt -mtriple=thumbv8.1m.main-none-none-eabi -hardware-loops \
@@ -65,15 +63,6 @@ do.end:
 ; CHECK-NOT: [[LOOP_DEC1:%[^ ]+]] = call i1 @llvm.loop.decrement.i32(i32 1)
 ; CHECK-NOT: br i1 [[LOOP_DEC1]], label %while.cond1.preheader.us, label 
%while.end7
 
-; CHECK-LLC:  nested:
-; CHECK-LLC-NOT:mov lr, r1
-; CHECK-LLC:dls lr, r1
-; CHECK-LLC-NOT:mov lr, r1
-; CHECK-LLC:  [[LOOP_HEADER:\.LBB[0-9._]+]]:
-; CHECK-LLC:le lr, [[LOOP_HEADER]]
-; CHECK-LLC-NOT:b [[LOOP_EXIT:\.LBB[0-9._]+]]
-; CHECK-LLC:  [[LOOP_EXIT:\.LBB[0-9._]+]]:
-
 define void @nested(i32* nocapture %A, i32 %N) {
 entry:
   %cmp20 = icmp eq i32 %N, 0
@@ -363,12 +352,6 @@ for.body:
 ; CHECK: call i1 @llvm.test.set.loop.iterations.i32(i32 %N)
 ; CHECK: call i32 @llvm.loop.decrement.reg.i32(
 
-; CHECK-LLC-LABEL: unroll_inc_unsigned:
-; CHECK-LLC: wls lr, r3, [[EXIT:.LBB[0-9_]+]]
-; CHECK-LLC: [[HEADER:.LBB[0-9_]+]]:
-; CHECK-LLC: le lr, [[HEADER]]
-; CHECK-LLC-NEXT: [[EXIT]]:
-
 ; TODO: We should be able to support the unrolled loop body.
 ; CHECK-UNROLL-LABEL: unroll_inc_unsigned
 ; CHECK-UNROLL: [[PREHEADER:.LBB[0-9_]+]]: @ %for.body.preheader
@@ -407,14 +390,6 @@ for.body:
 ; CHECK: call i32 @llvm.start.loop.iterations.i32(i32 %N)
 ; CHECK: call i32 @llvm.loop.decrement.reg.i32(
 
-; TODO: An unnecessary register is being held to hold COUNT, lr should just
-; be used instead.
-; CHECK-LLC-LABEL: unroll_dec_int:
-; CHECK-LLC: dls lr, r3
-; CHECK-LLC-NOT: mov lr, r3
-; CHECK-LLC: [[HEADER:.LBB[0-9_]+]]:
-; CHECK-LLC: le lr, [[HEADER]]
-
 ; CHECK-UNROLL-LABEL: unroll_dec_int:
 ; CHECK-UNROLL: wls lr, {{.*}}, [[PROLOGUE_EXIT:.LBB[0-9_]+]]
 ; CHECK-UNROLL-NEXT: [[PROLOGUE:.LBB[0-9_]+]]:



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] f5abf0b - [ARM] Tail predication with constant loop bounds

2021-01-15 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-15T18:17:31Z
New Revision: f5abf0bd485a1fa7e332f5f8266c25755d385a8a

URL: 
https://github.com/llvm/llvm-project/commit/f5abf0bd485a1fa7e332f5f8266c25755d385a8a
DIFF: 
https://github.com/llvm/llvm-project/commit/f5abf0bd485a1fa7e332f5f8266c25755d385a8a.diff

LOG: [ARM] Tail predication with constant loop bounds

The TripCount for a predicated vector loop body will be
ceil(ElementCount/Width). This alters the conversion of an
active.lane.mask to a VCPT intrinsics to match.

Differential Revision: https://reviews.llvm.org/D94608

Added: 


Modified: 
llvm/lib/Target/ARM/MVETailPredication.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/tp-multiple-vpst.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/MVETailPredication.cpp 
b/llvm/lib/Target/ARM/MVETailPredication.cpp
index 8055b5cf500d..b705208660df 100644
--- a/llvm/lib/Target/ARM/MVETailPredication.cpp
+++ b/llvm/lib/Target/ARM/MVETailPredication.cpp
@@ -230,18 +230,16 @@ bool MVETailPredication::IsSafeActiveMask(IntrinsicInst 
*ActiveLaneMask,
 }
 
 // Calculate 2 tripcount values and check that they are consistent with
-// each other:
-// i) The number of loop iterations extracted from the set.loop.iterations
-//intrinsic, multipled by the vector width:
-uint64_t TC1 = TC->getZExtValue() * VectorWidth;
-
-// ii) TC1 has to be equal to TC + 1, with the + 1 to compensate for start
-// counting from 0.
-uint64_t TC2 = ConstElemCount->getZExtValue() + 1;
-
-// If the tripcount values are inconsistent, we don't want to insert the
-// VCTP and trigger tail-predication; it's better to keep intrinsic
-// get.active.lane.mask and legalize this.
+// each other. The TripCount for a predicated vector loop body is
+// ceil(ElementCount/Width), or floor((ElementCount+Width-1)/Width) as we
+// work it out here.
+uint64_t TC1 = TC->getZExtValue();
+uint64_t TC2 =
+(ConstElemCount->getZExtValue() + VectorWidth - 1) / VectorWidth;
+
+// If the tripcount values are inconsistent, we can't insert the VCTP and
+// trigger tail-predication; keep the intrinsic as a get.active.lane.mask
+// and legalize this.
 if (TC1 != TC2) {
   LLVM_DEBUG(dbgs() << "ARM TP: inconsistent constant tripcount values: "
  << TC1 << " from set.loop.iterations, and "

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll
index 480680bee89d..d1f5a07bc4a9 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll
@@ -62,41 +62,17 @@ define dso_local i32 @test_501_504(i32* nocapture readonly 
%x) {
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:.save {r7, lr}
 ; CHECK-NEXT:push {r7, lr}
-; CHECK-NEXT:adr r2, .LCPI1_0
-; CHECK-NEXT:mov.w lr, #126
-; CHECK-NEXT:vldrw.u32 q0, [r2]
-; CHECK-NEXT:adr r2, .LCPI1_1
-; CHECK-NEXT:vldrw.u32 q1, [r2]
-; CHECK-NEXT:dls lr, lr
-; CHECK-NEXT:movs r1, #0
+; CHECK-NEXT:movw r1, #501
 ; CHECK-NEXT:movs r2, #0
+; CHECK-NEXT:dlstp.32 lr, r1
 ; CHECK-NEXT:  .LBB1_1: @ %vector.body
 ; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:vadd.i32 q2, q0, r1
-; CHECK-NEXT:vdup.32 q3, r1
-; CHECK-NEXT:vcmp.u32 hi, q3, q2
-; CHECK-NEXT:adds r1, #4
-; CHECK-NEXT:vpnot
-; CHECK-NEXT:vpsttt
-; CHECK-NEXT:vcmpt.u32 hi, q1, q2
-; CHECK-NEXT:vldrwt.u32 q2, [r0], #16
-; CHECK-NEXT:vaddvat.u32 r2, q2
-; CHECK-NEXT:le lr, .LBB1_1
+; CHECK-NEXT:vldrw.u32 q0, [r0], #16
+; CHECK-NEXT:vaddva.u32 r2, q0
+; CHECK-NEXT:letp lr, .LBB1_1
 ; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
 ; CHECK-NEXT:mov r0, r2
 ; CHECK-NEXT:pop {r7, pc}
-; CHECK-NEXT:.p2align 4
-; CHECK-NEXT:  @ %bb.3:
-; CHECK-NEXT:  .LCPI1_0:
-; CHECK-NEXT:.long 0 @ 0x0
-; CHECK-NEXT:.long 1 @ 0x1
-; CHECK-NEXT:.long 2 @ 0x2
-; CHECK-NEXT:.long 3 @ 0x3
-; CHECK-NEXT:  .LCPI1_1:
-; CHECK-NEXT:.long 501 @ 0x1f5
-; CHECK-NEXT:.long 501 @ 0x1f5
-; CHECK-NEXT:.long 501 @ 0x1f5
-; CHECK-NEXT:.long 501 @ 0x1f5
 entry:
   br label %vector.body
 
@@ -123,41 +99,17 @@ define dso_local i32 @test_502_504(i32* nocapture readonly 
%x) {
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:.save {r7, lr}
 ; CHECK-NEXT:push {r7, lr}
-; CHECK-NEXT:adr r2, .LCPI2_0
-; CHECK-NEXT:mov.w lr, #126
-; CHECK-NEXT:vldrw.u32 q0, [r2]
-; CHECK-NEXT:adr r2, .LCPI2_1
-; CHECK-NEXT:vldrw.u32 q1, [r2]
-; CHECK-NEXT:dls lr, lr
-; CHECK-NEXT:movs r1, #0
+; CHECK-NEXT:mov.w r1, #502
 ; CHECK-NEXT:movs r2, #0
+; CHECK-NEXT:dlstp.32 lr, r1
 ; CHECK-NEXT:  .LBB2_1: @ %vector.body
 ; CHECK-NEXT:@ =>This Inner Loop 

[llvm-branch-commits] [llvm] a0770f9 - [ARM] Constant tripcount tail predication loop tests. NFC

2021-01-15 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-15T18:02:07Z
New Revision: a0770f9e4e923292066dd095cf01a28671e40ad6

URL: 
https://github.com/llvm/llvm-project/commit/a0770f9e4e923292066dd095cf01a28671e40ad6
DIFF: 
https://github.com/llvm/llvm-project/commit/a0770f9e4e923292066dd095cf01a28671e40ad6.diff

LOG: [ARM] Constant tripcount tail predication loop tests. NFC

Added: 
llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll

Modified: 


Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll
new file mode 100644
index ..480680bee89d
--- /dev/null
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/constbound.ll
@@ -0,0 +1,277 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve %s -o - | 
FileCheck %s
+
+define dso_local i32 @test_500_504(i32* nocapture readonly %x) {
+; CHECK-LABEL: test_500_504:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r7, lr}
+; CHECK-NEXT:push {r7, lr}
+; CHECK-NEXT:mov.w lr, #126
+; CHECK-NEXT:adr r2, .LCPI0_0
+; CHECK-NEXT:vldrw.u32 q0, [r2]
+; CHECK-NEXT:mov.w r2, #500
+; CHECK-NEXT:dls lr, lr
+; CHECK-NEXT:vdup.32 q1, r2
+; CHECK-NEXT:movs r1, #0
+; CHECK-NEXT:movs r2, #0
+; CHECK-NEXT:  .LBB0_1: @ %vector.body
+; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:vadd.i32 q2, q0, r1
+; CHECK-NEXT:vdup.32 q3, r1
+; CHECK-NEXT:vcmp.u32 hi, q3, q2
+; CHECK-NEXT:adds r1, #4
+; CHECK-NEXT:vpnot
+; CHECK-NEXT:vpsttt
+; CHECK-NEXT:vcmpt.u32 hi, q1, q2
+; CHECK-NEXT:vldrwt.u32 q2, [r0], #16
+; CHECK-NEXT:vaddvat.u32 r2, q2
+; CHECK-NEXT:le lr, .LBB0_1
+; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
+; CHECK-NEXT:mov r0, r2
+; CHECK-NEXT:pop {r7, pc}
+; CHECK-NEXT:.p2align 4
+; CHECK-NEXT:  @ %bb.3:
+; CHECK-NEXT:  .LCPI0_0:
+; CHECK-NEXT:.long 0 @ 0x0
+; CHECK-NEXT:.long 1 @ 0x1
+; CHECK-NEXT:.long 2 @ 0x2
+; CHECK-NEXT:.long 3 @ 0x3
+entry:
+  br label %vector.body
+
+vector.body:  ; preds = %vector.body, 
%entry
+  %index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
+  %vec.phi = phi i32 [ 0, %entry ], [ %4, %vector.body ]
+  %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 
%index, i32 500)
+  %0 = getelementptr inbounds i32, i32* %x, i32 %index
+  %1 = bitcast i32* %0 to <4 x i32>*
+  %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x 
i32>* %1, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
+  %2 = select <4 x i1> %active.lane.mask, <4 x i32> %wide.masked.load, <4 x 
i32> zeroinitializer
+  %3 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %2)
+  %4 = add i32 %3, %vec.phi
+  %index.next = add i32 %index, 4
+  %5 = icmp eq i32 %index.next, 504
+  br i1 %5, label %for.cond.cleanup, label %vector.body
+
+for.cond.cleanup: ; preds = %vector.body
+  ret i32 %4
+}
+
+define dso_local i32 @test_501_504(i32* nocapture readonly %x) {
+; CHECK-LABEL: test_501_504:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r7, lr}
+; CHECK-NEXT:push {r7, lr}
+; CHECK-NEXT:adr r2, .LCPI1_0
+; CHECK-NEXT:mov.w lr, #126
+; CHECK-NEXT:vldrw.u32 q0, [r2]
+; CHECK-NEXT:adr r2, .LCPI1_1
+; CHECK-NEXT:vldrw.u32 q1, [r2]
+; CHECK-NEXT:dls lr, lr
+; CHECK-NEXT:movs r1, #0
+; CHECK-NEXT:movs r2, #0
+; CHECK-NEXT:  .LBB1_1: @ %vector.body
+; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:vadd.i32 q2, q0, r1
+; CHECK-NEXT:vdup.32 q3, r1
+; CHECK-NEXT:vcmp.u32 hi, q3, q2
+; CHECK-NEXT:adds r1, #4
+; CHECK-NEXT:vpnot
+; CHECK-NEXT:vpsttt
+; CHECK-NEXT:vcmpt.u32 hi, q1, q2
+; CHECK-NEXT:vldrwt.u32 q2, [r0], #16
+; CHECK-NEXT:vaddvat.u32 r2, q2
+; CHECK-NEXT:le lr, .LBB1_1
+; CHECK-NEXT:  @ %bb.2: @ %for.cond.cleanup
+; CHECK-NEXT:mov r0, r2
+; CHECK-NEXT:pop {r7, pc}
+; CHECK-NEXT:.p2align 4
+; CHECK-NEXT:  @ %bb.3:
+; CHECK-NEXT:  .LCPI1_0:
+; CHECK-NEXT:.long 0 @ 0x0
+; CHECK-NEXT:.long 1 @ 0x1
+; CHECK-NEXT:.long 2 @ 0x2
+; CHECK-NEXT:.long 3 @ 0x3
+; CHECK-NEXT:  .LCPI1_1:
+; CHECK-NEXT:.long 501 @ 0x1f5
+; CHECK-NEXT:.long 501 @ 0x1f5
+; CHECK-NEXT:.long 501 @ 0x1f5
+; CHECK-NEXT:.long 501 @ 0x1f5
+entry:
+  br label %vector.body
+
+vector.body:  ; preds = %vector.body, 
%entry
+  %index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
+  %vec.phi = phi i32 [ 0, %entry ], [ %4, %vector.body ]
+  %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 
%index, i32 501)
+  %0 = getelementptr inbounds i32, i32* %x, i32 %index
+  %1 = bitcast i32* %0 to <4 x i32>*
+  %wide.masked.load 

[llvm-branch-commits] [llvm] c29ca85 - [ARM] Update isVMOVNOriginalMask to handle single input shuffle vectors

2021-01-13 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-13T08:51:28Z
New Revision: c29ca8551afff316976c2befcd65eeef53798499

URL: 
https://github.com/llvm/llvm-project/commit/c29ca8551afff316976c2befcd65eeef53798499
DIFF: 
https://github.com/llvm/llvm-project/commit/c29ca8551afff316976c2befcd65eeef53798499.diff

LOG: [ARM] Update isVMOVNOriginalMask to handle single input shuffle vectors

The isVMOVNOriginalMask was previously only checking for two input
shuffles that could be better expanded as vmovn nodes. This expands that
to single input shuffles that will later be legalized to multiple
vectors.

Differential Revision: https://reviews.llvm.org/D94189

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/Thumb2/mve-vmovnstore.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 982397dbb2db..46c5efa2cf2f 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -14695,18 +14695,22 @@ static SDValue 
PerformSplittingToNarrowingStores(StoreSDNode *St,
   // use the VMOVN over splitting the store. We are looking for patterns of:
   // !rev: 0 N 1 N+1 2 N+2 ...
   //  rev: N 0 N+1 1 N+2 2 ...
-  auto isVMOVNOriginalMask = [&](ArrayRef M, bool rev) {
+  // The shuffle may either be a single source (in which case N = NumElts/2) or
+  // two inputs extended with concat to the same size (in which case N =
+  // NumElts).
+  auto isVMOVNShuffle = [&](ShuffleVectorSDNode *SVN, bool Rev) {
+ArrayRef M = SVN->getMask();
 unsigned NumElts = ToVT.getVectorNumElements();
-if (NumElts != M.size())
-  return false;
+if (SVN->getOperand(1).isUndef())
+  NumElts /= 2;
 
-unsigned Off0 = rev ? NumElts : 0;
-unsigned Off1 = rev ? 0 : NumElts;
+unsigned Off0 = Rev ? NumElts : 0;
+unsigned Off1 = Rev ? 0 : NumElts;
 
-for (unsigned i = 0; i < NumElts; i += 2) {
-  if (M[i] >= 0 && M[i] != (int)(Off0 + i / 2))
+for (unsigned I = 0; I < NumElts; I += 2) {
+  if (M[I] >= 0 && M[I] != (int)(Off0 + I / 2))
 return false;
-  if (M[i + 1] >= 0 && M[i + 1] != (int)(Off1 + i / 2))
+  if (M[I + 1] >= 0 && M[I + 1] != (int)(Off1 + I / 2))
 return false;
 }
 
@@ -14721,9 +14725,8 @@ static SDValue 
PerformSplittingToNarrowingStores(StoreSDNode *St,
   return SDValue();
 }
   }
-  if (auto *Shuffle = dyn_cast(Trunc->getOperand(0)))
-if (isVMOVNOriginalMask(Shuffle->getMask(), false) ||
-isVMOVNOriginalMask(Shuffle->getMask(), true))
+  if (auto *Shuffle = dyn_cast(Trunc.getOperand(0)))
+if (isVMOVNShuffle(Shuffle, false) || isVMOVNShuffle(Shuffle, true))
   return SDValue();
 
   LLVMContext  = *DAG.getContext();

diff  --git a/llvm/test/CodeGen/Thumb2/mve-vmovnstore.ll 
b/llvm/test/CodeGen/Thumb2/mve-vmovnstore.ll
index f9a535e9d2dc..aba29b4e5e48 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vmovnstore.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vmovnstore.ll
@@ -30,16 +30,8 @@ entry:
 define arm_aapcs_vfpcc void @vmovn32_trunc1_onesrc(<8 x i32> %src1, <8 x i16> 
*%dest) {
 ; CHECK-LABEL: vmovn32_trunc1_onesrc:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:vmov.f32 s8, s2
-; CHECK-NEXT:vmov.f32 s9, s6
-; CHECK-NEXT:vmov.f32 s10, s3
-; CHECK-NEXT:vmov.f32 s11, s7
-; CHECK-NEXT:vstrh.32 q2, [r0, #8]
-; CHECK-NEXT:vmov.f32 s8, s0
-; CHECK-NEXT:vmov.f32 s9, s4
-; CHECK-NEXT:vmov.f32 s10, s1
-; CHECK-NEXT:vmov.f32 s11, s5
-; CHECK-NEXT:vstrh.32 q2, [r0]
+; CHECK-NEXT:vmovnt.i32 q0, q1
+; CHECK-NEXT:vstrw.32 q0, [r0]
 ; CHECK-NEXT:bx lr
 entry:
   %strided.vec = shufflevector <8 x i32> %src1, <8 x i32> undef, <8 x i32> 

@@ -51,16 +43,8 @@ entry:
 define arm_aapcs_vfpcc void @vmovn32_trunc2_onesrc(<8 x i32> %src1, <8 x i16> 
*%dest) {
 ; CHECK-LABEL: vmovn32_trunc2_onesrc:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:vmov.f32 s8, s6
-; CHECK-NEXT:vmov.f32 s9, s2
-; CHECK-NEXT:vmov.f32 s10, s7
-; CHECK-NEXT:vmov.f32 s11, s3
-; CHECK-NEXT:vstrh.32 q2, [r0, #8]
-; CHECK-NEXT:vmov.f32 s8, s4
-; CHECK-NEXT:vmov.f32 s9, s0
-; CHECK-NEXT:vmov.f32 s10, s5
-; CHECK-NEXT:vmov.f32 s11, s1
-; CHECK-NEXT:vstrh.32 q2, [r0]
+; CHECK-NEXT:vmovnt.i32 q1, q0
+; CHECK-NEXT:vstrw.32 q1, [r0]
 ; CHECK-NEXT:bx lr
 entry:
   %strided.vec = shufflevector <8 x i32> %src1, <8 x i32> undef, <8 x i32> 

@@ -98,40 +82,8 @@ entry:
 define arm_aapcs_vfpcc void @vmovn16_trunc1_onesrc(<16 x i16> %src1, <16 x i8> 
*%dest) {
 ; CHECK-LABEL: vmovn16_trunc1_onesrc:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:vmov.u16 r1, q0[4]
-; CHECK-NEXT:vmov.16 q2[0], r1
-; CHECK-NEXT:vmov.u16 r1, q1[4]
-; CHECK-NEXT:vmov.16 q2[1], r1
-; CHECK-NEXT:vmov.u16 r1, q0[5]
-; CHECK-NEXT:vmov.16 q2[2], r1
-; CHECK-NEXT:vmov.u16 r1, q1[5]
-; 

[llvm-branch-commits] [llvm] 3aeb30d - [ARM] Additional tests for different interleaving patterns. NFC

2021-01-13 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-13T08:31:50Z
New Revision: 3aeb30d1a68a76616c699587e07a7d8880c29d1c

URL: 
https://github.com/llvm/llvm-project/commit/3aeb30d1a68a76616c699587e07a7d8880c29d1c
DIFF: 
https://github.com/llvm/llvm-project/commit/3aeb30d1a68a76616c699587e07a7d8880c29d1c.diff

LOG: [ARM] Additional tests for different interleaving patterns. NFC

Added: 


Modified: 
llvm/test/CodeGen/Thumb2/mve-shuffleext.ll
llvm/test/CodeGen/Thumb2/mve-vcvt.ll
llvm/test/CodeGen/Thumb2/mve-vmovnstore.ll
llvm/test/CodeGen/Thumb2/mve-vqdmulh.ll

Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-shuffleext.ll 
b/llvm/test/CodeGen/Thumb2/mve-shuffleext.ll
index c7165e71b5dd..715be1d921ca 100644
--- a/llvm/test/CodeGen/Thumb2/mve-shuffleext.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-shuffleext.ll
@@ -14,6 +14,33 @@ entry:
   ret <4 x i32> %out
 }
 
+define arm_aapcs_vfpcc <4 x i32> @sext_i32_0246_swapped(<8 x i16> %src) {
+; CHECK-LABEL: sext_i32_0246_swapped:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.u16 r0, q0[2]
+; CHECK-NEXT:vmov.u16 r1, q0[0]
+; CHECK-NEXT:vmov q1[2], q1[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[3]
+; CHECK-NEXT:vmov.u16 r1, q0[1]
+; CHECK-NEXT:vmov q1[3], q1[1], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[6]
+; CHECK-NEXT:vmov.u16 r1, q0[4]
+; CHECK-NEXT:vmov q2[2], q2[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[7]
+; CHECK-NEXT:vmov.u16 r1, q0[5]
+; CHECK-NEXT:vmovlb.s16 q0, q1
+; CHECK-NEXT:vmov q2[3], q2[1], r1, r0
+; CHECK-NEXT:vmov.f32 s1, s2
+; CHECK-NEXT:vmovlb.s16 q2, q2
+; CHECK-NEXT:vmov.f32 s2, s8
+; CHECK-NEXT:vmov.f32 s3, s10
+; CHECK-NEXT:bx lr
+entry:
+  %out = sext <8 x i16> %src to <8 x i32>
+  %strided.vec = shufflevector <8 x i32> %out, <8 x i32> undef, <4 x i32> 
+  ret <4 x i32> %strided.vec
+}
+
 define arm_aapcs_vfpcc <4 x i32> @sext_i32_1357(<8 x i16> %src) {
 ; CHECK-LABEL: sext_i32_1357:
 ; CHECK:   @ %bb.0: @ %entry
@@ -25,6 +52,34 @@ entry:
   ret <4 x i32> %out
 }
 
+define arm_aapcs_vfpcc <4 x i32> @sext_i32_1357_swapped(<8 x i16> %src) {
+; CHECK-LABEL: sext_i32_1357_swapped:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.u16 r0, q0[2]
+; CHECK-NEXT:vmov.u16 r1, q0[0]
+; CHECK-NEXT:vmov q1[2], q1[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[3]
+; CHECK-NEXT:vmov.u16 r1, q0[1]
+; CHECK-NEXT:vmov q1[3], q1[1], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[6]
+; CHECK-NEXT:vmov.u16 r1, q0[4]
+; CHECK-NEXT:vmovlb.s16 q1, q1
+; CHECK-NEXT:vmov q2[2], q2[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[7]
+; CHECK-NEXT:vmov.u16 r1, q0[5]
+; CHECK-NEXT:vmov.f32 s0, s5
+; CHECK-NEXT:vmov q2[3], q2[1], r1, r0
+; CHECK-NEXT:vmov.f32 s1, s7
+; CHECK-NEXT:vmovlb.s16 q2, q2
+; CHECK-NEXT:vmov.f32 s2, s9
+; CHECK-NEXT:vmov.f32 s3, s11
+; CHECK-NEXT:bx lr
+entry:
+  %out = sext <8 x i16> %src to <8 x i32>
+  %strided.vec = shufflevector <8 x i32> %out, <8 x i32> undef, <4 x i32> 
+  ret <4 x i32> %strided.vec
+}
+
 define arm_aapcs_vfpcc <8 x i32> @sext_i32_02468101214(<16 x i16> %src) {
 ; CHECK-LABEL: sext_i32_02468101214:
 ; CHECK:   @ %bb.0: @ %entry
@@ -37,6 +92,50 @@ entry:
   ret <8 x i32> %out
 }
 
+define arm_aapcs_vfpcc <8 x i32> @sext_i32_02468101214_swapped(<16 x i16> 
%src) {
+; CHECK-LABEL: sext_i32_02468101214_swapped:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.u16 r0, q0[2]
+; CHECK-NEXT:vmov.u16 r1, q0[0]
+; CHECK-NEXT:vmov q2[2], q2[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[3]
+; CHECK-NEXT:vmov.u16 r1, q0[1]
+; CHECK-NEXT:vmov q2[3], q2[1], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[6]
+; CHECK-NEXT:vmov.u16 r1, q0[4]
+; CHECK-NEXT:vmov q3[2], q3[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[7]
+; CHECK-NEXT:vmov.u16 r1, q0[5]
+; CHECK-NEXT:vmovlb.s16 q0, q2
+; CHECK-NEXT:vmov q3[3], q3[1], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q1[2]
+; CHECK-NEXT:vmov.u16 r1, q1[0]
+; CHECK-NEXT:vmovlb.s16 q3, q3
+; CHECK-NEXT:vmov.f32 s1, s2
+; CHECK-NEXT:vmov q2[2], q2[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q1[3]
+; CHECK-NEXT:vmov.u16 r1, q1[1]
+; CHECK-NEXT:vmov.f32 s2, s12
+; CHECK-NEXT:vmov q2[3], q2[1], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q1[6]
+; CHECK-NEXT:vmov.u16 r1, q1[4]
+; CHECK-NEXT:vmov.f32 s3, s14
+; CHECK-NEXT:vmov q3[2], q3[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q1[7]
+; CHECK-NEXT:vmov.u16 r1, q1[5]
+; CHECK-NEXT:vmovlb.s16 q1, q2
+; CHECK-NEXT:vmov q3[3], q3[1], r1, r0
+; CHECK-NEXT:vmovlb.s16 q3, q3
+; CHECK-NEXT:vmov.f32 s5, s6
+; CHECK-NEXT:vmov.f32 s6, s12
+; CHECK-NEXT:vmov.f32 s7, s14
+; CHECK-NEXT:bx lr
+entry:
+  %out = sext <16 x i16> %src to <16 x i32>
+  %strided.vec = shufflevector <16 x i32> %out, <16 x i32> undef, <8 x i32> 

+  ret <8 x i32> %strided.vec
+}
+
 

[llvm-branch-commits] [llvm] 8165a03 - [ARM] Add debug messages for the load store optimizer. NFC

2021-01-11 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-11T09:24:28Z
New Revision: 8165a0342033e58ce6090fbc425ebdc7c455469f

URL: 
https://github.com/llvm/llvm-project/commit/8165a0342033e58ce6090fbc425ebdc7c455469f
DIFF: 
https://github.com/llvm/llvm-project/commit/8165a0342033e58ce6090fbc425ebdc7c455469f.diff

LOG: [ARM] Add debug messages for the load store optimizer. NFC

Added: 


Modified: 
llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp 
b/llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
index a5da50608087..5144cf953e99 100644
--- a/llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
@@ -1268,6 +1268,7 @@ findIncDecAfter(MachineBasicBlock::iterator MBBI, 
Register Reg,
 bool ARMLoadStoreOpt::MergeBaseUpdateLSMultiple(MachineInstr *MI) {
   // Thumb1 is already using updating loads/stores.
   if (isThumb1) return false;
+  LLVM_DEBUG(dbgs() << "Attempting to merge update of: " << *MI);
 
   const MachineOperand  = MI->getOperand(0);
   Register Base = BaseOP.getReg();
@@ -1319,8 +1320,10 @@ bool 
ARMLoadStoreOpt::MergeBaseUpdateLSMultiple(MachineInstr *MI) {
 return false;
 }
   }
-  if (MergeInstr != MBB.end())
+  if (MergeInstr != MBB.end()) {
+LLVM_DEBUG(dbgs() << "  Erasing old increment: " << *MergeInstr);
 MBB.erase(MergeInstr);
+  }
 
   unsigned NewOpc = getUpdatingLSMultipleOpcode(Opcode, Mode);
   MachineInstrBuilder MIB = BuildMI(MBB, MBBI, DL, TII->get(NewOpc))
@@ -1335,6 +1338,7 @@ bool 
ARMLoadStoreOpt::MergeBaseUpdateLSMultiple(MachineInstr *MI) {
   // Transfer memoperands.
   MIB.setMemRefs(MI->memoperands());
 
+  LLVM_DEBUG(dbgs() << "  Added new load/store: " << *MIB);
   MBB.erase(MBBI);
   return true;
 }
@@ -1445,6 +1449,7 @@ bool 
ARMLoadStoreOpt::MergeBaseUpdateLoadStore(MachineInstr *MI) {
   // Thumb1 doesn't have updating LDR/STR.
   // FIXME: Use LDM/STM with single register instead.
   if (isThumb1) return false;
+  LLVM_DEBUG(dbgs() << "Attempting to merge update of: " << *MI);
 
   Register Base = getLoadStoreBaseOp(*MI).getReg();
   bool BaseKill = getLoadStoreBaseOp(*MI).isKill();
@@ -1486,6 +1491,7 @@ bool 
ARMLoadStoreOpt::MergeBaseUpdateLoadStore(MachineInstr *MI) {
 } else
   return false;
   }
+  LLVM_DEBUG(dbgs() << "  Erasing old increment: " << *MergeInstr);
   MBB.erase(MergeInstr);
 
   ARM_AM::AddrOpc AddSub = Offset < 0 ? ARM_AM::sub : ARM_AM::add;
@@ -1497,39 +1503,50 @@ bool 
ARMLoadStoreOpt::MergeBaseUpdateLoadStore(MachineInstr *MI) {
 // updating load/store-multiple instructions can be used with only one
 // register.)
 MachineOperand  = MI->getOperand(0);
-BuildMI(MBB, MBBI, DL, TII->get(NewOpc))
-  .addReg(Base, getDefRegState(true)) // WB base register
-  .addReg(Base, getKillRegState(isLd ? BaseKill : false))
-  .addImm(Pred).addReg(PredReg)
-  .addReg(MO.getReg(), (isLd ? getDefRegState(true) :
-getKillRegState(MO.isKill(
-  .cloneMemRefs(*MI);
+auto MIB = BuildMI(MBB, MBBI, DL, TII->get(NewOpc))
+   .addReg(Base, getDefRegState(true)) // WB base register
+   .addReg(Base, getKillRegState(isLd ? BaseKill : false))
+   .addImm(Pred)
+   .addReg(PredReg)
+   .addReg(MO.getReg(), (isLd ? getDefRegState(true)
+  : getKillRegState(MO.isKill(
+   .cloneMemRefs(*MI);
+LLVM_DEBUG(dbgs() << "  Added new instruction: " << *MIB);
   } else if (isLd) {
 if (isAM2) {
   // LDR_PRE, LDR_POST
   if (NewOpc == ARM::LDR_PRE_IMM || NewOpc == ARM::LDRB_PRE_IMM) {
-BuildMI(MBB, MBBI, DL, TII->get(NewOpc), MI->getOperand(0).getReg())
-  .addReg(Base, RegState::Define)
-  .addReg(Base).addImm(Offset).addImm(Pred).addReg(PredReg)
-  .cloneMemRefs(*MI);
+auto MIB =
+BuildMI(MBB, MBBI, DL, TII->get(NewOpc), 
MI->getOperand(0).getReg())
+.addReg(Base, RegState::Define)
+.addReg(Base)
+.addImm(Offset)
+.addImm(Pred)
+.addReg(PredReg)
+.cloneMemRefs(*MI);
+LLVM_DEBUG(dbgs() << "  Added new instruction: " << *MIB);
   } else {
 int Imm = ARM_AM::getAM2Opc(AddSub, Bytes, ARM_AM::no_shift);
-BuildMI(MBB, MBBI, DL, TII->get(NewOpc), MI->getOperand(0).getReg())
-.addReg(Base, RegState::Define)
-.addReg(Base)
-.addReg(0)
-.addImm(Imm)
-.add(predOps(Pred, PredReg))
-.cloneMemRefs(*MI);
+auto MIB =
+BuildMI(MBB, MBBI, DL, TII->get(NewOpc), 
MI->getOperand(0).getReg())
+.addReg(Base, RegState::Define)
+.addReg(Base)
+.addReg(0)
+

[llvm-branch-commits] [llvm] dcefcd5 - [ARM] Update trunc costs

2021-01-11 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-11T08:59:28Z
New Revision: dcefcd51e01741c79c9d9a729fe803b13287a372

URL: 
https://github.com/llvm/llvm-project/commit/dcefcd51e01741c79c9d9a729fe803b13287a372
DIFF: 
https://github.com/llvm/llvm-project/commit/dcefcd51e01741c79c9d9a729fe803b13287a372.diff

LOG: [ARM] Update trunc costs

We did not have specific costs for larger than legal truncates that were
not otherwise cheap (where they were next to stores, for example). As
MVE does not have a dedicated instruction for them (and we do not use
loads/stores yet), they should be expensive as they get expanded to a
series of lane moves.

Differential Revision: https://reviews.llvm.org/D94260

Added: 


Modified: 
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
llvm/test/Analysis/CostModel/ARM/cast.ll
llvm/test/Analysis/CostModel/ARM/mve-gather-scatter-cost.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp 
b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
index 0dc0afe271d1..a75c771e66be 100644
--- a/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
@@ -491,6 +491,7 @@ int ARMTTIImpl::getCastInstrCost(unsigned Opcode, Type 
*Dst, Type *Src,
 {ISD::TRUNCATE, MVT::v4i32, MVT::v4i8, 0},
 {ISD::TRUNCATE, MVT::v8i16, MVT::v8i8, 0},
 {ISD::TRUNCATE, MVT::v8i32, MVT::v8i16, 1},
+{ISD::TRUNCATE, MVT::v8i32, MVT::v8i8, 1},
 {ISD::TRUNCATE, MVT::v16i32, MVT::v16i8, 3},
 {ISD::TRUNCATE, MVT::v16i16, MVT::v16i8, 1},
 };
@@ -751,6 +752,18 @@ int ARMTTIImpl::getCastInstrCost(unsigned Opcode, Type 
*Dst, Type *Src,
   return Lanes * CallCost;
   }
 
+  if (ISD == ISD::TRUNCATE && ST->hasMVEIntegerOps() &&
+  SrcTy.isFixedLengthVector()) {
+// Treat a truncate with larger than legal source (128bits for MVE) as
+// expensive, 2 instructions per lane.
+if ((SrcTy.getScalarType() == MVT::i8 ||
+ SrcTy.getScalarType() == MVT::i16 ||
+ SrcTy.getScalarType() == MVT::i32) &&
+SrcTy.getSizeInBits() > 128 &&
+SrcTy.getSizeInBits() > DstTy.getSizeInBits())
+  return SrcTy.getVectorNumElements() * 2;
+  }
+
   // Scalar integer conversion costs.
   static const TypeConversionCostTblEntry ARMIntegerConversionTbl[] = {
 // i16 -> i64 requires two dependent operations.

diff  --git a/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll 
b/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
index 25b268b9b244..172df8600356 100644
--- a/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
+++ b/llvm/test/Analysis/CostModel/ARM/arith-overflow.ll
@@ -707,13 +707,13 @@ define i32 @smul(i32 %arg) {
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 316 for instruction: 
%V8I32 = call { <8 x i32>, <8 x i1> } @llvm.smul.with.overflow.v8i32(<8 x i32> 
undef, <8 x i32> undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 1208 for 
instruction: %V16I32 = call { <16 x i32>, <16 x i1> } 
@llvm.smul.with.overflow.v16i32(<16 x i32> undef, <16 x i32> undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: 
%I16 = call { i16, i1 } @llvm.smul.with.overflow.i16(i16 undef, i16 undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: 
%V8I16 = call { <8 x i16>, <8 x i1> } @llvm.smul.with.overflow.v8i16(<8 x i16> 
undef, <8 x i16> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 116 for instruction: 
%V16I16 = call { <16 x i16>, <16 x i1> } @llvm.smul.with.overflow.v16i16(<16 x 
i16> undef, <16 x i16> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 424 for instruction: 
%V32I16 = call { <32 x i16>, <32 x i1> } @llvm.smul.with.overflow.v32i16(<32 x 
i16> undef, <32 x i16> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 62 for instruction: 
%V8I16 = call { <8 x i16>, <8 x i1> } @llvm.smul.with.overflow.v8i16(<8 x i16> 
undef, <8 x i16> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 164 for instruction: 
%V16I16 = call { <16 x i16>, <16 x i1> } @llvm.smul.with.overflow.v16i16(<16 x 
i16> undef, <16 x i16> undef)
+; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 488 for instruction: 
%V32I16 = call { <32 x i16>, <32 x i1> } @llvm.smul.with.overflow.v32i16(<32 x 
i16> undef, <32 x i16> undef)
 ; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: 
%I8 = call { i8, i1 } @llvm.smul.with.overflow.i8(i8 undef, i8 undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: 
%V16I8 = call { <16 x i8>, <16 x i1> } @llvm.smul.with.overflow.v16i8(<16 x i8> 
undef, <16 x i8> undef)
-; MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 116 for instruction: 
%V32I8 = call { <32 x i8>, <32 x i1> } @llvm.smul.with.overflow.v32i8(<32 x i8> 
undef, 

[llvm-branch-commits] [llvm] 0c8b748 - [ARM] Additional trunc cost tests. NFC

2021-01-11 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-11T08:35:16Z
New Revision: 0c8b748f321736d016da0f6d710778f503a89b51

URL: 
https://github.com/llvm/llvm-project/commit/0c8b748f321736d016da0f6d710778f503a89b51
DIFF: 
https://github.com/llvm/llvm-project/commit/0c8b748f321736d016da0f6d710778f503a89b51.diff

LOG: [ARM] Additional trunc cost tests. NFC

Added: 


Modified: 
llvm/test/Analysis/CostModel/ARM/cast.ll

Removed: 




diff  --git a/llvm/test/Analysis/CostModel/ARM/cast.ll 
b/llvm/test/Analysis/CostModel/ARM/cast.ll
index b539dae1585e..3dc55674a131 100644
--- a/llvm/test/Analysis/CostModel/ARM/cast.ll
+++ b/llvm/test/Analysis/CostModel/ARM/cast.ll
@@ -124,8 +124,15 @@ define i32 @casts() {
 ; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for 
instruction: %rext_9 = zext <2 x i16> undef to <2 x i64>
 ; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 1 for 
instruction: %rext_a = sext <2 x i32> undef to <2 x i64>
 ; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 1 for 
instruction: %rext_b = zext <2 x i32> undef to <2 x i64>
-; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 3 for 
instruction: %r74 = trunc <8 x i32> undef to <8 x i8>
-; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 6 for 
instruction: %r75 = trunc <16 x i32> undef to <16 x i8>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 1 for 
instruction: %tv4i32i8 = trunc <4 x i32> undef to <4 x i8>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 1 for 
instruction: %tv4i32i16 = trunc <4 x i32> undef to <4 x i16>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 0 for 
instruction: %tv4i16i8 = trunc <4 x i16> undef to <4 x i8>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 3 for 
instruction: %tv8i32i8 = trunc <8 x i32> undef to <8 x i8>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 3 for 
instruction: %tv8i32i16 = trunc <8 x i32> undef to <8 x i16>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 1 for 
instruction: %tv8i16i8 = trunc <8 x i16> undef to <8 x i8>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 6 for 
instruction: %tv16i32i8 = trunc <16 x i32> undef to <16 x i8>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 6 for 
instruction: %tv16i32i16 = trunc <16 x i32> undef to <16 x i16>
+; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 3 for 
instruction: %tv16i16i8 = trunc <16 x i16> undef to <16 x i8>
 ; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 1 for 
instruction: %r80df = fptrunc double undef to float
 ; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for 
instruction: %r81df = fptrunc <2 x double> undef to <2 x float>
 ; CHECK-NEON-RECIP-NEXT:  Cost Model: Found an estimated cost of 4 for 
instruction: %r82df = fptrunc <4 x double> undef to <4 x float>
@@ -511,8 +518,15 @@ define i32 @casts() {
 ; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 4 for 
instruction: %rext_9 = zext <2 x i16> undef to <2 x i64>
 ; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 16 for 
instruction: %rext_a = sext <2 x i32> undef to <2 x i64>
 ; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 4 for 
instruction: %rext_b = zext <2 x i32> undef to <2 x i64>
-; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for 
instruction: %r74 = trunc <8 x i32> undef to <8 x i8>
-; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for 
instruction: %r75 = trunc <16 x i32> undef to <16 x i8>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 0 for 
instruction: %tv4i32i8 = trunc <4 x i32> undef to <4 x i8>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 0 for 
instruction: %tv4i32i16 = trunc <4 x i32> undef to <4 x i16>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 0 for 
instruction: %tv4i16i8 = trunc <4 x i16> undef to <4 x i8>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for 
instruction: %tv8i32i8 = trunc <8 x i32> undef to <8 x i8>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for 
instruction: %tv8i32i16 = trunc <8 x i32> undef to <8 x i16>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 0 for 
instruction: %tv8i16i8 = trunc <8 x i16> undef to <8 x i8>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for 
instruction: %tv16i32i8 = trunc <16 x i32> undef to <16 x i8>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 8 for 
instruction: %tv16i32i16 = trunc <16 x i32> undef to <16 x i16>
+; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 2 for 
instruction: %tv16i16i8 = trunc <16 x i16> undef to <16 x i8>
 ; CHECK-MVE-RECIP-NEXT:  Cost Model: Found an estimated cost of 10 for 
instruction: 

[llvm-branch-commits] [llvm] 024af42 - [ARM] Custom lower i1 vector truncates

2021-01-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-08T18:21:00Z
New Revision: 024af42c601063e5f831b3049612321b5629e00a

URL: 
https://github.com/llvm/llvm-project/commit/024af42c601063e5f831b3049612321b5629e00a
DIFF: 
https://github.com/llvm/llvm-project/commit/024af42c601063e5f831b3049612321b5629e00a.diff

LOG: [ARM] Custom lower i1 vector truncates

The ISel patterns we have for truncating to i1's under MVE do not seem
to be correct. Instead custom lower to icmp(ne, and(x, 1), 0).

Differential Revision: https://reviews.llvm.org/D94226

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/lib/Target/ARM/ARMInstrMVE.td
llvm/test/CodeGen/Thumb2/mve-pred-ext.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index efe2efe91bcf..982397dbb2db 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -443,6 +443,7 @@ void ARMTargetLowering::addMVEVectorTypes(bool HasMVEFP) {
 setOperationAction(ISD::SCALAR_TO_VECTOR, VT, Expand);
 setOperationAction(ISD::LOAD, VT, Custom);
 setOperationAction(ISD::STORE, VT, Custom);
+setOperationAction(ISD::TRUNCATE, VT, Custom);
   }
 }
 
@@ -8660,6 +8661,23 @@ static SDValue LowerEXTRACT_SUBVECTOR(SDValue Op, 
SelectionDAG ,
  DAG.getConstant(ARMCC::NE, dl, MVT::i32));
 }
 
+// Turn a truncate into a predicate (an i1 vector) into icmp(and(x, 1), 0).
+static SDValue LowerTruncatei1(SDValue N, SelectionDAG ,
+   const ARMSubtarget *ST) {
+  assert(ST->hasMVEIntegerOps() && "Expected MVE!");
+  EVT VT = N.getValueType();
+  assert((VT == MVT::v16i1 || VT == MVT::v8i1 || VT == MVT::v4i1) &&
+ "Expected a vector i1 type!");
+  SDValue Op = N.getOperand(0);
+  EVT FromVT = Op.getValueType();
+  SDLoc DL(N);
+
+  SDValue And =
+  DAG.getNode(ISD::AND, DL, FromVT, Op, DAG.getConstant(1, DL, FromVT));
+  return DAG.getNode(ISD::SETCC, DL, VT, And, DAG.getConstant(0, DL, FromVT),
+ DAG.getCondCode(ISD::SETNE));
+}
+
 /// isExtendedBUILD_VECTOR - Check if N is a constant BUILD_VECTOR where each
 /// element has been zero/sign-extended, depending on the isSigned parameter,
 /// from an integer type half its size.
@@ -9771,6 +9789,7 @@ SDValue ARMTargetLowering::LowerOperation(SDValue Op, 
SelectionDAG ) const {
   case ISD::INSERT_VECTOR_ELT: return LowerINSERT_VECTOR_ELT(Op, DAG);
   case ISD::EXTRACT_VECTOR_ELT: return LowerEXTRACT_VECTOR_ELT(Op, DAG, 
Subtarget);
   case ISD::CONCAT_VECTORS: return LowerCONCAT_VECTORS(Op, DAG, Subtarget);
+  case ISD::TRUNCATE:  return LowerTruncatei1(Op, DAG, Subtarget);
   case ISD::FLT_ROUNDS_:   return LowerFLT_ROUNDS_(Op, DAG);
   case ISD::MUL:   return LowerMUL(Op, DAG);
   case ISD::SDIV:

diff  --git a/llvm/lib/Target/ARM/ARMInstrMVE.td 
b/llvm/lib/Target/ARM/ARMInstrMVE.td
index b4e4397b44c9..0dfea68887e5 100644
--- a/llvm/lib/Target/ARM/ARMInstrMVE.td
+++ b/llvm/lib/Target/ARM/ARMInstrMVE.td
@@ -6759,13 +6759,6 @@ let Predicates = [HasMVEInt] in {
 (v8i16 (MVE_VPSEL (MVE_VMOVimmi16 1), (MVE_VMOVimmi16 0), 
ARMVCCNone, VCCR:$pred))>;
   def : Pat<(v4i32 (anyext  (v4i1  VCCR:$pred))),
 (v4i32 (MVE_VPSEL (MVE_VMOVimmi32 1), (MVE_VMOVimmi32 0), 
ARMVCCNone, VCCR:$pred))>;
-
-  def : Pat<(v16i1 (trunc (v16i8 MQPR:$v1))),
-(v16i1 (MVE_VCMPi32r (v16i8 MQPR:$v1), ZR, ARMCCne))>;
-  def : Pat<(v8i1 (trunc (v8i16  MQPR:$v1))),
-(v8i1 (MVE_VCMPi32r (v8i16 MQPR:$v1), ZR, ARMCCne))>;
-  def : Pat<(v4i1 (trunc (v4i32  MQPR:$v1))),
-(v4i1 (MVE_VCMPi32r (v4i32 MQPR:$v1), ZR, ARMCCne))>;
 }
 
 let Predicates = [HasMVEFloat] in {

diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-ext.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-ext.ll
index c280fa2ed658..9fe502a26bbc 100644
--- a/llvm/test/CodeGen/Thumb2/mve-pred-ext.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-ext.ll
@@ -159,8 +159,10 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @trunc_v4i1_v4i32(<4 x i32> %src) {
 ; CHECK-LABEL: trunc_v4i1_v4i32:
 ; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.i32 q2, #0x1
 ; CHECK-NEXT:vmov.i32 q1, #0x0
-; CHECK-NEXT:vcmp.i32 ne, q0, zr
+; CHECK-NEXT:vand q2, q0, q2
+; CHECK-NEXT:vcmp.i32 ne, q2, zr
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
 entry:
@@ -172,8 +174,10 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @trunc_v8i1_v8i16(<8 x i16> %src) {
 ; CHECK-LABEL: trunc_v8i1_v8i16:
 ; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.i16 q2, #0x1
 ; CHECK-NEXT:vmov.i32 q1, #0x0
-; CHECK-NEXT:vcmp.i32 ne, q0, zr
+; CHECK-NEXT:vand q2, q0, q2
+; CHECK-NEXT:vcmp.i16 ne, q2, zr
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
 entry:
@@ -185,8 +189,10 @@ entry:
 define arm_aapcs_vfpcc <16 x i8> @trunc_v16i1_v16i8(<16 x i8> %src) {
 ; CHECK-LABEL: 

[llvm-branch-commits] [llvm] e185b1d - [ConstProp] Constant propagation for get.active.lane.mask instrinsics

2021-01-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-08T16:10:01Z
New Revision: e185b1dd7b34c352167823295281f1bf1df09976

URL: 
https://github.com/llvm/llvm-project/commit/e185b1dd7b34c352167823295281f1bf1df09976
DIFF: 
https://github.com/llvm/llvm-project/commit/e185b1dd7b34c352167823295281f1bf1df09976.diff

LOG: [ConstProp] Constant propagation for get.active.lane.mask instrinsics

Similar to the Arm VCTP intrinsics, if the operands of an
active.lane.mask are both known, the constant lane mask can be
calculated. This can come up after unrolling the loops.

Differential Revision: https://reviews.llvm.org/D94103

Added: 
llvm/test/Transforms/InstSimplify/ConstProp/active-lane-mask.ll

Modified: 
llvm/lib/Analysis/ConstantFolding.cpp

Removed: 




diff  --git a/llvm/lib/Analysis/ConstantFolding.cpp 
b/llvm/lib/Analysis/ConstantFolding.cpp
index 7b0d4bd5172b..22b9acbc03b8 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -1456,6 +1456,7 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, 
const Function *F) {
   case Intrinsic::launder_invariant_group:
   case Intrinsic::strip_invariant_group:
   case Intrinsic::masked_load:
+  case Intrinsic::get_active_lane_mask:
   case Intrinsic::abs:
   case Intrinsic::smax:
   case Intrinsic::smin:
@@ -2927,6 +2928,25 @@ static Constant *ConstantFoldVectorCall(StringRef Name,
 }
 break;
   }
+  case Intrinsic::get_active_lane_mask: {
+auto *Op0 = dyn_cast(Operands[0]);
+auto *Op1 = dyn_cast(Operands[1]);
+if (Op0 && Op1) {
+  unsigned Lanes = FVTy->getNumElements();
+  uint64_t Base = Op0->getZExtValue();
+  uint64_t Limit = Op1->getZExtValue();
+
+  SmallVector NCs;
+  for (unsigned i = 0; i < Lanes; i++) {
+if (Base + i < Limit)
+  NCs.push_back(ConstantInt::getTrue(Ty));
+else
+  NCs.push_back(ConstantInt::getFalse(Ty));
+  }
+  return ConstantVector::get(NCs);
+}
+break;
+  }
   default:
 break;
   }

diff  --git a/llvm/test/Transforms/InstSimplify/ConstProp/active-lane-mask.ll 
b/llvm/test/Transforms/InstSimplify/ConstProp/active-lane-mask.ll
new file mode 100644
index ..a6006bca169c
--- /dev/null
+++ b/llvm/test/Transforms/InstSimplify/ConstProp/active-lane-mask.ll
@@ -0,0 +1,300 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -instsimplify -S -o - %s | FileCheck %s
+
+target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
+
+define <16 x i1> @v16i1_0() {
+; CHECK-LABEL: @v16i1_0(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> zeroinitializer
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 0)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_1() {
+; CHECK-LABEL: @v16i1_1(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 1)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_8() {
+; CHECK-LABEL: @v16i1_8(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 8)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_15() {
+; CHECK-LABEL: @v16i1_15(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 15)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_16() {
+; CHECK-LABEL: @v16i1_16(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 16)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_100() {
+; CHECK-LABEL: @v16i1_100(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 100)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_m1() {
+; CHECK-LABEL: @v16i1_m1(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 0, i32 -1)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_10_11() {
+; CHECK-LABEL: @v16i1_10_11(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> 
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 10, i32 11)
+  ret <16 x i1> %int
+}
+
+define <16 x i1> @v16i1_12_11() {
+; CHECK-LABEL: @v16i1_12_11(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <16 x i1> zeroinitializer
+;
+entry:
+  %int = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 12, i32 11)
+  ret <16 x i1> %int
+}
+
+
+
+define <8 x i1> @v8i1_0() {
+; CHECK-LABEL: @v8i1_0(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:ret <8 x i1> zeroinitializer
+;
+entry:
+  %int = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 0, i32 0)
+  ret <8 x i1> %int
+}
+
+define <8 x i1> @v8i1_1() {
+; CHECK-LABEL: 

[llvm-branch-commits] [llvm] a36a286 - [ARM][LV] Additional loop invariant reduction test. NFC

2021-01-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-08T15:15:08Z
New Revision: a36a2864c0d4b89b66e0cdfde0f82d569a293e10

URL: 
https://github.com/llvm/llvm-project/commit/a36a2864c0d4b89b66e0cdfde0f82d569a293e10
DIFF: 
https://github.com/llvm/llvm-project/commit/a36a2864c0d4b89b66e0cdfde0f82d569a293e10.diff

LOG: [ARM][LV] Additional loop invariant reduction test. NFC

Added: 


Modified: 
llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

Removed: 




diff  --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll 
b/llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll
index e3ae35e91159..5b97fef2bdcc 100644
--- a/llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll
+++ b/llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll
@@ -1045,6 +1045,57 @@ for.cond.cleanup: ; 
preds = %for.body, %entry
   ret float %r.0.lcssa
 }
 
+define i64 @loopinvariant_mla(i32* nocapture readonly %x, i32 %y, i32 %n) #0 {
+; CHECK-LABEL: @loopinvariant_mla(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:[[CMP7:%.*]] = icmp sgt i32 [[N:%.*]], 0
+; CHECK-NEXT:br i1 [[CMP7]], label [[FOR_BODY_LR_PH:%.*]], label 
[[FOR_COND_CLEANUP:%.*]]
+; CHECK:   for.body.lr.ph:
+; CHECK-NEXT:[[CONV1:%.*]] = sext i32 [[Y:%.*]] to i64
+; CHECK-NEXT:br label [[FOR_BODY:%.*]]
+; CHECK:   for.cond.cleanup.loopexit:
+; CHECK-NEXT:[[ADD_LCSSA:%.*]] = phi i64 [ [[ADD:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:br label [[FOR_COND_CLEANUP]]
+; CHECK:   for.cond.cleanup:
+; CHECK-NEXT:[[S_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ 
[[ADD_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT:%.*]] ]
+; CHECK-NEXT:ret i64 [[S_0_LCSSA]]
+; CHECK:   for.body:
+; CHECK-NEXT:[[I_09:%.*]] = phi i32 [ 0, [[FOR_BODY_LR_PH]] ], [ 
[[INC:%.*]], [[FOR_BODY]] ]
+; CHECK-NEXT:[[S_08:%.*]] = phi i64 [ 0, [[FOR_BODY_LR_PH]] ], [ [[ADD]], 
[[FOR_BODY]] ]
+; CHECK-NEXT:[[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* 
[[X:%.*]], i32 [[I_09]]
+; CHECK-NEXT:[[TMP0:%.*]] = load i32, i32* [[ARRAYIDX]], align 4
+; CHECK-NEXT:[[CONV:%.*]] = sext i32 [[TMP0]] to i64
+; CHECK-NEXT:[[MUL:%.*]] = mul nsw i64 [[CONV]], [[CONV1]]
+; CHECK-NEXT:[[ADD]] = add nsw i64 [[MUL]], [[S_08]]
+; CHECK-NEXT:[[INC]] = add nuw nsw i32 [[I_09]], 1
+; CHECK-NEXT:[[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]
+; CHECK-NEXT:br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP_LOOPEXIT]], 
label [[FOR_BODY]]
+;
+entry:
+  %cmp7 = icmp sgt i32 %n, 0
+  br i1 %cmp7, label %for.body.lr.ph, label %for.cond.cleanup
+
+for.body.lr.ph:   ; preds = %entry
+  %conv1 = sext i32 %y to i64
+  br label %for.body
+
+for.cond.cleanup: ; preds = %for.body, %entry
+  %s.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
+  ret i64 %s.0.lcssa
+
+for.body: ; preds = %for.body.lr.ph, 
%for.body
+  %i.09 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
+  %s.08 = phi i64 [ 0, %for.body.lr.ph ], [ %add, %for.body ]
+  %arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.09
+  %0 = load i32, i32* %arrayidx, align 4
+  %conv = sext i32 %0 to i64
+  %mul = mul nsw i64 %conv, %conv1
+  %add = add nsw i64 %mul, %s.08
+  %inc = add nuw nsw i32 %i.09, 1
+  %exitcond.not = icmp eq i32 %inc, %n
+  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
+}
+
 attributes #0 = { "target-features"="+mve.fp" }
 !6 = distinct !{!6, !7}
 !7 = !{!"llvm.loop.vectorize.width", i32 16}



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] 1ae7624 - [ARM] Update and regenerate test checks. NFC

2021-01-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-08T14:54:16Z
New Revision: 1ae762469fd11be0b5a10353281a8264ab97b166

URL: 
https://github.com/llvm/llvm-project/commit/1ae762469fd11be0b5a10353281a8264ab97b166
DIFF: 
https://github.com/llvm/llvm-project/commit/1ae762469fd11be0b5a10353281a8264ab97b166.diff

LOG: [ARM] Update and regenerate test checks. NFC

Added: 


Modified: 
llvm/test/CodeGen/ARM/arm-shrink-wrapping.ll
llvm/test/CodeGen/ARM/indexed-mem.ll

Removed: 




diff  --git a/llvm/test/CodeGen/ARM/arm-shrink-wrapping.ll 
b/llvm/test/CodeGen/ARM/arm-shrink-wrapping.ll
index e6fc02970e4f..b5c63af5a348 100644
--- a/llvm/test/CodeGen/ARM/arm-shrink-wrapping.ll
+++ b/llvm/test/CodeGen/ARM/arm-shrink-wrapping.ll
@@ -1760,102 +1760,199 @@ declare double @llvm.pow.f64(double, double)
 ;
 ; bl
 define float @debug_info(float %gamma, float %slopeLimit, i1 %or.cond, double 
%tmp) "frame-pointer"="all" {
-; ARM-LABEL: debug_info:
-; ARM:   @ %bb.0: @ %bb
-; ARM-NEXT:push {r4, r7, lr}
-; ARM-NEXT:add r7, sp, #4
-; ARM-NEXT:sub r4, sp, #16
-; ARM-NEXT:bfc r4, #0, #4
-; ARM-NEXT:mov sp, r4
-; ARM-NEXT:tst r2, #1
-; ARM-NEXT:vst1.64 {d8, d9}, [r4:128]
-; ARM-NEXT:beq LBB12_2
-; ARM-NEXT:  @ %bb.1: @ %bb3
-; ARM-NEXT:ldr r1, [r7, #8]
-; ARM-NEXT:vmov s16, r0
-; ARM-NEXT:mov r0, r3
-; ARM-NEXT:mov r2, r3
-; ARM-NEXT:vmov d9, r3, r1
-; ARM-NEXT:mov r3, r1
-; ARM-NEXT:bl _pow
-; ARM-NEXT:vmov.f32 s0, #1.00e+00
-; ARM-NEXT:vmov.f64 d16, #1.00e+00
-; ARM-NEXT:vadd.f64 d16, d9, d16
-; ARM-NEXT:vcmp.f32 s16, s0
-; ARM-NEXT:vmrs APSR_nzcv, fpscr
-; ARM-NEXT:vmov d17, r0, r1
-; ARM-NEXT:vmov.f64 d18, d9
-; ARM-NEXT:vadd.f64 d17, d17, d17
-; ARM-NEXT:vmovgt.f64 d18, d16
-; ARM-NEXT:vcmp.f64 d18, d9
-; ARM-NEXT:vmrs APSR_nzcv, fpscr
-; ARM-NEXT:vmovne.f64 d9, d17
-; ARM-NEXT:vcvt.f32.f64 s0, d9
-; ARM-NEXT:b LBB12_3
-; ARM-NEXT:  LBB12_2:
-; ARM-NEXT:vldr s0, LCPI12_0
-; ARM-NEXT:  LBB12_3: @ %bb13
-; ARM-NEXT:mov r4, sp
-; ARM-NEXT:vld1.64 {d8, d9}, [r4:128]
-; ARM-NEXT:vmov r0, s0
-; ARM-NEXT:sub sp, r7, #4
-; ARM-NEXT:pop {r4, r7, pc}
-; ARM-NEXT:.p2align 2
-; ARM-NEXT:  @ %bb.4:
-; ARM-NEXT:.data_region
-; ARM-NEXT:  LCPI12_0:
-; ARM-NEXT:.long 0 @ float 0
-; ARM-NEXT:.end_data_region
-;
-; THUMB-LABEL: debug_info:
-; THUMB:   @ %bb.0: @ %bb
-; THUMB-NEXT:push {r4, r7, lr}
-; THUMB-NEXT:add r7, sp, #4
-; THUMB-NEXT:sub.w r4, sp, #16
-; THUMB-NEXT:bfc r4, #0, #4
-; THUMB-NEXT:mov sp, r4
-; THUMB-NEXT:lsls r1, r2, #31
-; THUMB-NEXT:vst1.64 {d8, d9}, [r4:128]
-; THUMB-NEXT:beq LBB12_2
-; THUMB-NEXT:  @ %bb.1: @ %bb3
-; THUMB-NEXT:ldr r1, [r7, #8]
-; THUMB-NEXT:vmov s16, r0
-; THUMB-NEXT:mov r0, r3
-; THUMB-NEXT:mov r2, r3
-; THUMB-NEXT:vmov d9, r3, r1
-; THUMB-NEXT:mov r3, r1
-; THUMB-NEXT:bl _pow
-; THUMB-NEXT:vmov.f32 s0, #1.00e+00
-; THUMB-NEXT:vmov.f64 d16, #1.00e+00
-; THUMB-NEXT:vmov.f64 d18, d9
-; THUMB-NEXT:vcmp.f32 s16, s0
-; THUMB-NEXT:vadd.f64 d16, d9, d16
-; THUMB-NEXT:vmrs APSR_nzcv, fpscr
-; THUMB-NEXT:it gt
-; THUMB-NEXT:vmovgt.f64 d18, d16
-; THUMB-NEXT:vcmp.f64 d18, d9
-; THUMB-NEXT:vmov d17, r0, r1
-; THUMB-NEXT:vmrs APSR_nzcv, fpscr
-; THUMB-NEXT:vadd.f64 d17, d17, d17
-; THUMB-NEXT:it ne
-; THUMB-NEXT:vmovne.f64 d9, d17
-; THUMB-NEXT:vcvt.f32.f64 s0, d9
-; THUMB-NEXT:b LBB12_3
-; THUMB-NEXT:  LBB12_2:
-; THUMB-NEXT:vldr s0, LCPI12_0
-; THUMB-NEXT:  LBB12_3: @ %bb13
-; THUMB-NEXT:mov r4, sp
-; THUMB-NEXT:vld1.64 {d8, d9}, [r4:128]
-; THUMB-NEXT:subs r4, r7, #4
-; THUMB-NEXT:vmov r0, s0
-; THUMB-NEXT:mov sp, r4
-; THUMB-NEXT:pop {r4, r7, pc}
-; THUMB-NEXT:.p2align 2
-; THUMB-NEXT:  @ %bb.4:
-; THUMB-NEXT:.data_region
-; THUMB-NEXT:  LCPI12_0:
-; THUMB-NEXT:.long 0 @ float 0
-; THUMB-NEXT:.end_data_region
+; ARM-ENABLE-LABEL: debug_info:
+; ARM-ENABLE:   @ %bb.0: @ %bb
+; ARM-ENABLE-NEXT:push {r4, r7, lr}
+; ARM-ENABLE-NEXT:add r7, sp, #4
+; ARM-ENABLE-NEXT:sub r4, sp, #16
+; ARM-ENABLE-NEXT:bfc r4, #0, #4
+; ARM-ENABLE-NEXT:mov sp, r4
+; ARM-ENABLE-NEXT:tst r2, #1
+; ARM-ENABLE-NEXT:vst1.64 {d8, d9}, [r4:128]
+; ARM-ENABLE-NEXT:beq LBB12_2
+; ARM-ENABLE-NEXT:  @ %bb.1: @ %bb3
+; ARM-ENABLE-NEXT:ldr r1, [r7, #8]
+; ARM-ENABLE-NEXT:vmov s16, r0
+; ARM-ENABLE-NEXT:mov r0, r3
+; ARM-ENABLE-NEXT:mov r2, r3
+; ARM-ENABLE-NEXT:vmov d9, r3, r1
+; ARM-ENABLE-NEXT:mov r3, r1
+; ARM-ENABLE-NEXT:bl _pow
+; ARM-ENABLE-NEXT:vmov.f32 s0, #1.00e+00
+; ARM-ENABLE-NEXT:vmov.f64 d16, #1.00e+00
+; ARM-ENABLE-NEXT:vadd.f64 d16, d9, d16
+; ARM-ENABLE-NEXT:vcmp.f32 s16, s0
+; ARM-ENABLE-NEXT:vmrs APSR_nzcv, fpscr
+; 

[llvm-branch-commits] [llvm] 72fb5ba - [LV] Don't sink into replication regions

2021-01-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-08T09:50:10Z
New Revision: 72fb5ba079019c2108d676526b5285b228795e48

URL: 
https://github.com/llvm/llvm-project/commit/72fb5ba079019c2108d676526b5285b228795e48
DIFF: 
https://github.com/llvm/llvm-project/commit/72fb5ba079019c2108d676526b5285b228795e48.diff

LOG: [LV] Don't sink into replication regions

The new test case here contains a first order recurrences and an
instruction that is replicated. The first order recurrence forces an
instruction to be sunk _into_, as opposed to after the replication
region. That causes several things to go wrong including registering
vector instructions multiple times and failing to create dominance
relations correctly.

Instead we should be sinking to after the replication region, which is
what this patch makes sure happens.

Differential Revision: https://reviews.llvm.org/D93629

Added: 


Modified: 
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
llvm/lib/Transforms/Vectorize/VPlan.cpp
llvm/lib/Transforms/Vectorize/VPlan.h
llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
llvm/unittests/Transforms/Vectorize/VPlanTest.cpp

Removed: 




diff  --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp 
b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 1518b757186d..0b58ad1ff2ea 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8507,6 +8507,18 @@ VPlanPtr 
LoopVectorizationPlanner::buildVPlanWithVPRecipes(
   for (auto  : SinkAfter) {
 VPRecipeBase *Sink = RecipeBuilder.getRecipe(Entry.first);
 VPRecipeBase *Target = RecipeBuilder.getRecipe(Entry.second);
+// If the target is in a replication region, make sure to move Sink to the
+// block after it, not into the replication region itself.
+if (auto *Region =
+dyn_cast_or_null(Target->getParent()->getParent())) 
{
+  if (Region->isReplicator()) {
+assert(Region->getNumSuccessors() == 1 && "Expected SESE region!");
+VPBasicBlock *NextBlock =
+cast(Region->getSuccessors().front());
+Sink->moveBefore(*NextBlock, NextBlock->getFirstNonPhi());
+continue;
+  }
+}
 Sink->moveAfter(Target);
   }
 

diff  --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp 
b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index c6e44d11e7b3..bca6d73dc44b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -433,6 +433,14 @@ void VPRecipeBase::moveAfter(VPRecipeBase *InsertPos) {
   insertAfter(InsertPos);
 }
 
+void VPRecipeBase::moveBefore(VPBasicBlock ,
+  iplist::iterator I) {
+  assert(I == BB.end() || I->getParent() == );
+  removeFromParent();
+  Parent = 
+  BB.getRecipeList().insert(I, this);
+}
+
 void VPInstruction::generateInstruction(VPTransformState ,
 unsigned Part) {
   IRBuilder<>  = State.Builder;

diff  --git a/llvm/lib/Transforms/Vectorize/VPlan.h 
b/llvm/lib/Transforms/Vectorize/VPlan.h
index dcc7d3db9b97..1926c9255a58 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -664,6 +664,11 @@ class VPRecipeBase : public 
ilist_node_with_parent,
   /// the VPBasicBlock that MovePos lives in, right after MovePos.
   void moveAfter(VPRecipeBase *MovePos);
 
+  /// Unlink this recipe and insert into BB before I.
+  ///
+  /// \pre I is a valid iterator into BB.
+  void moveBefore(VPBasicBlock , iplist::iterator I);
+
   /// This method unlinks 'this' from the containing basic block, but does not
   /// delete it.
   void removeFromParent();

diff  --git a/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll 
b/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
index 242402d25666..ce2d2adcce99 100644
--- a/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
@@ -645,3 +645,235 @@ for.cond:
 for.end:
   ret void
 }
+
+define i32 @sink_into_replication_region(i32 %y) {
+; CHECK-LABEL: @sink_into_replication_region(
+; CHECK-NEXT:  bb:
+; CHECK-NEXT:[[TMP0:%.*]] = icmp sgt i32 [[Y:%.*]], 1
+; CHECK-NEXT:[[TMP1:%.*]] = select i1 [[TMP0]], i32 [[Y]], i32 1
+; CHECK-NEXT:br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK:   vector.ph:
+; CHECK-NEXT:[[N_RND_UP:%.*]] = add nuw i32 [[TMP1]], 3
+; CHECK-NEXT:[[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
+; CHECK-NEXT:[[TRIP_COUNT_MINUS_1:%.*]] = add nsw i32 [[TMP1]], -1
+; CHECK-NEXT:[[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> 
poison, i32 [[TRIP_COUNT_MINUS_1]], i32 0
+; CHECK-NEXT:[[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> 
[[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
+; CHECK-NEXT:br label [[VECTOR_BODY:%.*]]
+; CHECK:   vector.body:
+; 

[llvm-branch-commits] [llvm] 63dce70 - [ARM] Handle any extend whilst lowering addw/addl/subw/subl

2021-01-06 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-06T11:26:39Z
New Revision: 63dce70b794eb99ebbfdeed3ca9aafca2b8fe5c4

URL: 
https://github.com/llvm/llvm-project/commit/63dce70b794eb99ebbfdeed3ca9aafca2b8fe5c4
DIFF: 
https://github.com/llvm/llvm-project/commit/63dce70b794eb99ebbfdeed3ca9aafca2b8fe5c4.diff

LOG: [ARM] Handle any extend whilst lowering addw/addl/subw/subl

Same as a9b6440edd, use zanyext to treat any_extends as zero extends
during lowering to create addw/addl/subw/subl nodes.

Differential Revision: https://reviews.llvm.org/D93835

Added: 


Modified: 
llvm/lib/Target/ARM/ARMInstrNEON.td
llvm/test/CodeGen/ARM/vadd.ll
llvm/test/CodeGen/ARM/vsub.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMInstrNEON.td 
b/llvm/lib/Target/ARM/ARMInstrNEON.td
index bb30dbd3a5c9..a8c0d05d91c4 100644
--- a/llvm/lib/Target/ARM/ARMInstrNEON.td
+++ b/llvm/lib/Target/ARM/ARMInstrNEON.td
@@ -4197,10 +4197,10 @@ def  VADDhq   : N3VQ<0, 0, 0b01, 0b1101, 0, IIC_VBINQ, 
"vadd", "f16",
 defm VADDLs   : N3VLExt_QHS<0,1,0b,0, IIC_VSHLiD, IIC_VSHLiD,
 "vaddl", "s", add, sext, 1>;
 defm VADDLu   : N3VLExt_QHS<1,1,0b,0, IIC_VSHLiD, IIC_VSHLiD,
-"vaddl", "u", add, zext, 1>;
+"vaddl", "u", add, zanyext, 1>;
 //   VADDW: Vector Add Wide (Q = Q + D)
 defm VADDWs   : N3VW_QHS<0,1,0b0001,0, "vaddw", "s", add, sext, 0>;
-defm VADDWu   : N3VW_QHS<1,1,0b0001,0, "vaddw", "u", add, zext, 0>;
+defm VADDWu   : N3VW_QHS<1,1,0b0001,0, "vaddw", "u", add, zanyext, 0>;
 //   VHADD: Vector Halving Add
 defm VHADDs   : N3VInt_QHS<0, 0, 0b, 0, N3RegFrm,
IIC_VBINi4D, IIC_VBINi4D, IIC_VBINi4Q, IIC_VBINi4Q,
@@ -5045,10 +5045,10 @@ def  VSUBhq   : N3VQ<0, 0, 0b11, 0b1101, 0, IIC_VBINQ, 
"vsub", "f16",
 defm VSUBLs   : N3VLExt_QHS<0,1,0b0010,0, IIC_VSHLiD, IIC_VSHLiD,
 "vsubl", "s", sub, sext, 0>;
 defm VSUBLu   : N3VLExt_QHS<1,1,0b0010,0, IIC_VSHLiD, IIC_VSHLiD,
-"vsubl", "u", sub, zext, 0>;
+"vsubl", "u", sub, zanyext, 0>;
 //   VSUBW: Vector Subtract Wide (Q = Q - D)
 defm VSUBWs   : N3VW_QHS<0,1,0b0011,0, "vsubw", "s", sub, sext, 0>;
-defm VSUBWu   : N3VW_QHS<1,1,0b0011,0, "vsubw", "u", sub, zext, 0>;
+defm VSUBWu   : N3VW_QHS<1,1,0b0011,0, "vsubw", "u", sub, zanyext, 0>;
 //   VHSUB: Vector Halving Subtract
 defm VHSUBs   : N3VInt_QHS<0, 0, 0b0010, 0, N3RegFrm,
IIC_VSUBi4D, IIC_VSUBi4D, IIC_VSUBi4Q, IIC_VSUBi4Q,

diff  --git a/llvm/test/CodeGen/ARM/vadd.ll b/llvm/test/CodeGen/ARM/vadd.ll
index 5f0ddd17c8c7..282108244e5c 100644
--- a/llvm/test/CodeGen/ARM/vadd.ll
+++ b/llvm/test/CodeGen/ARM/vadd.ll
@@ -224,9 +224,7 @@ define <2 x i64> @vaddlu32(<2 x i32> %A, <2 x i32> %B) {
 define <8 x i16> @vaddla8(<8 x i8> %A, <8 x i8> %B) {
 ; CHECK-LABEL: vaddla8:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:vmovl.u8 q8, d1
-; CHECK-NEXT:vmovl.u8 q9, d0
-; CHECK-NEXT:vadd.i16 q0, q9, q8
+; CHECK-NEXT:vaddl.u8 q0, d0, d1
 ; CHECK-NEXT:vbic.i16 q0, #0xff00
 ; CHECK-NEXT:bx lr
   %tmp3 = zext <8 x i8> %A to <8 x i16>
@@ -239,11 +237,9 @@ define <8 x i16> @vaddla8(<8 x i8> %A, <8 x i8> %B) {
 define <4 x i32> @vaddla16(<4 x i16> %A, <4 x i16> %B) {
 ; CHECK-LABEL: vaddla16:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:vmovl.u16 q8, d1
-; CHECK-NEXT:vmovl.u16 q9, d0
-; CHECK-NEXT:vmov.i32 q10, #0x
-; CHECK-NEXT:vadd.i32 q8, q9, q8
-; CHECK-NEXT:vand q0, q8, q10
+; CHECK-NEXT:vmov.i32 q8, #0x
+; CHECK-NEXT:vaddl.u16 q9, d0, d1
+; CHECK-NEXT:vand q0, q9, q8
 ; CHECK-NEXT:bx lr
   %tmp3 = zext <4 x i16> %A to <4 x i32>
   %tmp4 = zext <4 x i16> %B to <4 x i32>
@@ -255,11 +251,9 @@ define <4 x i32> @vaddla16(<4 x i16> %A, <4 x i16> %B) {
 define <2 x i64> @vaddla32(<2 x i32> %A, <2 x i32> %B) {
 ; CHECK-LABEL: vaddla32:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:vmovl.u32 q8, d1
-; CHECK-NEXT:vmovl.u32 q9, d0
-; CHECK-NEXT:vmov.i64 q10, #0x
-; CHECK-NEXT:vadd.i64 q8, q9, q8
-; CHECK-NEXT:vand q0, q8, q10
+; CHECK-NEXT:vmov.i64 q8, #0x
+; CHECK-NEXT:vaddl.u32 q9, d0, d1
+; CHECK-NEXT:vand q0, q9, q8
 ; CHECK-NEXT:bx lr
   %tmp3 = zext <2 x i32> %A to <2 x i64>
   %tmp4 = zext <2 x i32> %B to <2 x i64>
@@ -331,8 +325,7 @@ define <2 x i64> @vaddwu32(<2 x i64> %A, <2 x i32> %B) {
 define <8 x i16> @vaddwa8(<8 x i16> %A, <8 x i8> %B) {
 ; CHECK-LABEL: vaddwa8:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:vmovl.u8 q8, d2
-; CHECK-NEXT:vadd.i16 q0, q0, q8
+; CHECK-NEXT:vaddw.u8 q0, q0, d2
 ; CHECK-NEXT:vbic.i16 q0, #0xff00
 ; CHECK-NEXT:bx lr
   %tmp3 = zext <8 x i8> %B to <8 x i16>
@@ -344,10 +337,9 @@ define <8 x i16> @vaddwa8(<8 x i16> %A, <8 x i8> %B) {
 define <4 x i32> @vaddwa16(<4 x i32> %A, <4 x i16> %B) {
 ; 

[llvm-branch-commits] [llvm] ddb82fc - [ARM] Handle any extend whilst lowering mull

2021-01-06 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-06T10:51:12Z
New Revision: ddb82fc76ceb92e6f361d35f1ee8dedaee756854

URL: 
https://github.com/llvm/llvm-project/commit/ddb82fc76ceb92e6f361d35f1ee8dedaee756854
DIFF: 
https://github.com/llvm/llvm-project/commit/ddb82fc76ceb92e6f361d35f1ee8dedaee756854.diff

LOG: [ARM] Handle any extend whilst lowering mull

Similar to 78d8a821e23e but for ARM, this handles any_extend whilst
creating MULL nodes, treating them as zextends.

Differential Revision: https://reviews.llvm.org/D93834

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/ARM/vmla.ll
llvm/test/CodeGen/ARM/vmls.ll
llvm/test/CodeGen/ARM/vmul.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 6a8355f0c3e8..efe2efe91bcf 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -8724,10 +8724,11 @@ static bool isSignExtended(SDNode *N, SelectionDAG 
) {
   return false;
 }
 
-/// isZeroExtended - Check if a node is a vector value that is zero-extended
-/// or a constant BUILD_VECTOR with zero-extended elements.
+/// isZeroExtended - Check if a node is a vector value that is zero-extended 
(or
+/// any-extended) or a constant BUILD_VECTOR with zero-extended elements.
 static bool isZeroExtended(SDNode *N, SelectionDAG ) {
-  if (N->getOpcode() == ISD::ZERO_EXTEND || ISD::isZEXTLoad(N))
+  if (N->getOpcode() == ISD::ZERO_EXTEND || N->getOpcode() == ISD::ANY_EXTEND 
||
+  ISD::isZEXTLoad(N))
 return true;
   if (isExtendedBUILD_VECTOR(N, DAG, false))
 return true;
@@ -8795,13 +8796,14 @@ static SDValue SkipLoadExtensionForVMULL(LoadSDNode 
*LD, SelectionDAG& DAG) {
 }
 
 /// SkipExtensionForVMULL - For a node that is a SIGN_EXTEND, ZERO_EXTEND,
-/// extending load, or BUILD_VECTOR with extended elements, return the
-/// unextended value. The unextended vector should be 64 bits so that it can
+/// ANY_EXTEND, extending load, or BUILD_VECTOR with extended elements, return
+/// the unextended value. The unextended vector should be 64 bits so that it 
can
 /// be used as an operand to a VMULL instruction. If the original vector size
 /// before extension is less than 64 bits we add a an extension to resize
 /// the vector to 64 bits.
 static SDValue SkipExtensionForVMULL(SDNode *N, SelectionDAG ) {
-  if (N->getOpcode() == ISD::SIGN_EXTEND || N->getOpcode() == ISD::ZERO_EXTEND)
+  if (N->getOpcode() == ISD::SIGN_EXTEND ||
+  N->getOpcode() == ISD::ZERO_EXTEND || N->getOpcode() == ISD::ANY_EXTEND)
 return AddRequiredExtensionForVMULL(N->getOperand(0), DAG,
 N->getOperand(0)->getValueType(0),
 N->getValueType(0),

diff  --git a/llvm/test/CodeGen/ARM/vmla.ll b/llvm/test/CodeGen/ARM/vmla.ll
index 14d425da2df4..43474efdf86b 100644
--- a/llvm/test/CodeGen/ARM/vmla.ll
+++ b/llvm/test/CodeGen/ARM/vmla.ll
@@ -156,9 +156,7 @@ define <2 x i64> @vmlalu32(<2 x i64> %A, <2 x i32> %B, <2 x 
i32> %C) nounwind {
 define <8 x i16> @vmlala8(<8 x i16> %A, <8 x i8> %B, <8 x i8> %C) nounwind {
 ; CHECK-LABEL: vmlala8:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:vmovl.u8 q8, d3
-; CHECK-NEXT:vmovl.u8 q9, d2
-; CHECK-NEXT:vmla.i16 q0, q9, q8
+; CHECK-NEXT:vmlal.u8 q0, d2, d3
 ; CHECK-NEXT:vbic.i16 q0, #0xff00
 ; CHECK-NEXT:bx lr
   %tmp4 = zext <8 x i8> %B to <8 x i16>
@@ -172,9 +170,7 @@ define <8 x i16> @vmlala8(<8 x i16> %A, <8 x i8> %B, <8 x 
i8> %C) nounwind {
 define <4 x i32> @vmlala16(<4 x i32> %A, <4 x i16> %B, <4 x i16> %C) nounwind {
 ; CHECK-LABEL: vmlala16:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:vmovl.u16 q8, d3
-; CHECK-NEXT:vmovl.u16 q9, d2
-; CHECK-NEXT:vmla.i32 q0, q9, q8
+; CHECK-NEXT:vmlal.u16 q0, d2, d3
 ; CHECK-NEXT:vmov.i32 q8, #0x
 ; CHECK-NEXT:vand q0, q0, q8
 ; CHECK-NEXT:bx lr
@@ -189,32 +185,10 @@ define <4 x i32> @vmlala16(<4 x i32> %A, <4 x i16> %B, <4 
x i16> %C) nounwind {
 define <2 x i64> @vmlala32(<2 x i64> %A, <2 x i32> %B, <2 x i32> %C) nounwind {
 ; CHECK-LABEL: vmlala32:
 ; CHECK:   @ %bb.0:
-; CHECK-NEXT:.save {r4, r5, r6, r7, r11, lr}
-; CHECK-NEXT:push {r4, r5, r6, r7, r11, lr}
-; CHECK-NEXT:vmovl.u32 q8, d3
-; CHECK-NEXT:vmovl.u32 q9, d2
-; CHECK-NEXT:vmov.32 r0, d16[0]
-; CHECK-NEXT:vmov.32 r1, d18[0]
-; CHECK-NEXT:vmov.32 r12, d16[1]
-; CHECK-NEXT:vmov.32 r3, d17[0]
-; CHECK-NEXT:vmov.32 r2, d19[0]
-; CHECK-NEXT:vmov.32 lr, d17[1]
-; CHECK-NEXT:vmov.32 r6, d19[1]
-; CHECK-NEXT:umull r7, r5, r1, r0
-; CHECK-NEXT:mla r1, r1, r12, r5
-; CHECK-NEXT:umull r5, r4, r2, r3
-; CHECK-NEXT:mla r2, r2, lr, r4
-; CHECK-NEXT:vmov.32 r4, d18[1]
-; CHECK-NEXT:vmov.i64 q9, #0x
-; CHECK-NEXT:mla r2, r6, r3, r2
-; CHECK-NEXT:vmov.32 d17[0], r5
-; 

[llvm-branch-commits] [llvm] a9b6440 - [AArch64] Handle any extend whilst lowering addw/addl/subw/subl

2021-01-06 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-06T10:35:23Z
New Revision: a9b6440eddde920920141d8ade9090273271a79b

URL: 
https://github.com/llvm/llvm-project/commit/a9b6440eddde920920141d8ade9090273271a79b
DIFF: 
https://github.com/llvm/llvm-project/commit/a9b6440eddde920920141d8ade9090273271a79b.diff

LOG: [AArch64] Handle any extend whilst lowering addw/addl/subw/subl

This adds an extra tablegen PatFrag, zanyext, which matches either any
extend or zext and uses that in the aarch64 backend to handle any
extends in addw/addl/subw/subl patterns.

Differential Revision: https://reviews.llvm.org/D93833

Added: 


Modified: 
llvm/include/llvm/Target/TargetSelectionDAG.td
llvm/lib/Target/AArch64/AArch64InstrInfo.td
llvm/test/CodeGen/AArch64/arm64-neon-3vdiff.ll
llvm/test/CodeGen/AArch64/lowerMUL-newload.ll

Removed: 




diff  --git a/llvm/include/llvm/Target/TargetSelectionDAG.td 
b/llvm/include/llvm/Target/TargetSelectionDAG.td
index 0c6eef939ea4..a1e961aa9cb5 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -920,6 +920,10 @@ def not  : PatFrag<(ops node:$in), (xor node:$in, -1)>;
 def vnot : PatFrag<(ops node:$in), (xor node:$in, immAllOnesV)>;
 def ineg : PatFrag<(ops node:$in), (sub 0, node:$in)>;
 
+def zanyext : PatFrags<(ops node:$op),
+   [(zext node:$op),
+(anyext node:$op)]>;
+
 // null_frag - The null pattern operator is used in multiclass instantiations
 // which accept an SDPatternOperator for use in matching patterns for internal
 // definitions. When expanding a pattern, if the null fragment is referenced

diff  --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td 
b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 7e9f2fb95188..6209f51b1631 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -4765,18 +4765,18 @@ defm SSUBW   : SIMDWideThreeVectorBHS<0, 0b0011, 
"ssubw",
 defm UABAL   : SIMDLongThreeVectorTiedBHSabal<1, 0b0101, "uabal",
   AArch64uabd>;
 defm UADDL   : SIMDLongThreeVectorBHS<1, 0b, "uaddl",
- BinOpFrag<(add (zext node:$LHS), (zext node:$RHS))>>;
+ BinOpFrag<(add (zanyext node:$LHS), (zanyext node:$RHS))>>;
 defm UADDW   : SIMDWideThreeVectorBHS<1, 0b0001, "uaddw",
- BinOpFrag<(add node:$LHS, (zext node:$RHS))>>;
+ BinOpFrag<(add node:$LHS, (zanyext node:$RHS))>>;
 defm UMLAL   : SIMDLongThreeVectorTiedBHS<1, 0b1000, "umlal",
 TriOpFrag<(add node:$LHS, (int_aarch64_neon_umull node:$MHS, node:$RHS))>>;
 defm UMLSL   : SIMDLongThreeVectorTiedBHS<1, 0b1010, "umlsl",
 TriOpFrag<(sub node:$LHS, (int_aarch64_neon_umull node:$MHS, node:$RHS))>>;
 defm UMULL   : SIMDLongThreeVectorBHS<1, 0b1100, "umull", 
int_aarch64_neon_umull>;
 defm USUBL   : SIMDLongThreeVectorBHS<1, 0b0010, "usubl",
- BinOpFrag<(sub (zext node:$LHS), (zext node:$RHS))>>;
+ BinOpFrag<(sub (zanyext node:$LHS), (zanyext node:$RHS))>>;
 defm USUBW   : SIMDWideThreeVectorBHS<   1, 0b0011, "usubw",
- BinOpFrag<(sub node:$LHS, (zext node:$RHS))>>;
+ BinOpFrag<(sub node:$LHS, (zanyext node:$RHS))>>;
 
 // Additional patterns for SMULL and UMULL
 multiclass Neon_mul_widen_patterns @test_vaddl_a8(<8 x i8> %a, <8 x i8> %b) {
 ; CHECK-LABEL: test_vaddl_a8:
 ; CHECK:   // %bb.0: // %entry
-; CHECK-NEXT:ushll v0.8h, v0.8b, #0
-; CHECK-NEXT:ushll v1.8h, v1.8b, #0
-; CHECK-NEXT:add v0.8h, v0.8h, v1.8h
+; CHECK-NEXT:uaddl v0.8h, v0.8b, v1.8b
 ; CHECK-NEXT:bic v0.8h, #255, lsl #8
 ; CHECK-NEXT:ret
 entry:
@@ -119,9 +117,7 @@ entry:
 define <4 x i32> @test_vaddl_a16(<4 x i16> %a, <4 x i16> %b) {
 ; CHECK-LABEL: test_vaddl_a16:
 ; CHECK:   // %bb.0: // %entry
-; CHECK-NEXT:ushll v0.4s, v0.4h, #0
-; CHECK-NEXT:ushll v1.4s, v1.4h, #0
-; CHECK-NEXT:add v0.4s, v0.4s, v1.4s
+; CHECK-NEXT:uaddl v0.4s, v0.4h, v1.4h
 ; CHECK-NEXT:movi v1.2d, #0x00
 ; CHECK-NEXT:and v0.16b, v0.16b, v1.16b
 ; CHECK-NEXT:ret
@@ -136,9 +132,7 @@ entry:
 define <2 x i64> @test_vaddl_a32(<2 x i32> %a, <2 x i32> %b) {
 ; CHECK-LABEL: test_vaddl_a32:
 ; CHECK:   // %bb.0: // %entry
-; CHECK-NEXT:ushll v0.2d, v0.2s, #0
-; CHECK-NEXT:ushll v1.2d, v1.2s, #0
-; CHECK-NEXT:add v0.2d, v0.2d, v1.2d
+; CHECK-NEXT:uaddl v0.2d, v0.2s, v1.2s
 ; CHECK-NEXT:movi v1.2d, #0x00
 ; CHECK-NEXT:and v0.16b, v0.16b, v1.16b
 ; CHECK-NEXT:ret
@@ -237,9 +231,7 @@ entry:
 define <8 x i16> @test_vaddl_high_a8(<16 x i8> %a, <16 x i8> %b) {
 ; CHECK-LABEL: test_vaddl_high_a8:
 ; CHECK:   // %bb.0: // %entry
-; CHECK-NEXT:ushll2 v0.8h, v0.16b, #0
-; CHECK-NEXT:ushll2 v1.8h, v1.16b, #0
-; CHECK-NEXT:add v0.8h, v0.8h, v1.8h
+; 

[llvm-branch-commits] [llvm] 78d8a82 - [AArch64] Handle any extend whilst lowering mull

2021-01-06 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-06T10:08:43Z
New Revision: 78d8a821e23e42d13dcbb3467747e480fb889b8a

URL: 
https://github.com/llvm/llvm-project/commit/78d8a821e23e42d13dcbb3467747e480fb889b8a
DIFF: 
https://github.com/llvm/llvm-project/commit/78d8a821e23e42d13dcbb3467747e480fb889b8a.diff

LOG: [AArch64] Handle any extend whilst lowering mull

Demanded bits may turn a sext or zext into an anyext if the top bits are
not needed. This currently prevents the lowering to instructions like
mull, addl and addw. This patch fixes the mull generation by keeping it
simple and treating them like zextends.

Differential Revision: https://reviews.llvm.org/D93832

Added: 


Modified: 
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
llvm/test/CodeGen/AArch64/aarch64-smull.ll
llvm/test/CodeGen/AArch64/lowerMUL-newload.ll

Removed: 




diff  --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp 
b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 2b9dc84a06cc..41dc285a368d 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -3347,7 +3347,8 @@ static bool isExtendedBUILD_VECTOR(SDNode *N, 
SelectionDAG ,
 }
 
 static SDValue skipExtensionForVectorMULL(SDNode *N, SelectionDAG ) {
-  if (N->getOpcode() == ISD::SIGN_EXTEND || N->getOpcode() == ISD::ZERO_EXTEND)
+  if (N->getOpcode() == ISD::SIGN_EXTEND ||
+  N->getOpcode() == ISD::ZERO_EXTEND || N->getOpcode() == ISD::ANY_EXTEND)
 return addRequiredExtensionForVectorMULL(N->getOperand(0), DAG,
  N->getOperand(0)->getValueType(0),
  N->getValueType(0),
@@ -3377,6 +3378,7 @@ static bool isSignExtended(SDNode *N, SelectionDAG ) {
 
 static bool isZeroExtended(SDNode *N, SelectionDAG ) {
   return N->getOpcode() == ISD::ZERO_EXTEND ||
+ N->getOpcode() == ISD::ANY_EXTEND ||
  isExtendedBUILD_VECTOR(N, DAG, false);
 }
 

diff  --git a/llvm/test/CodeGen/AArch64/aarch64-smull.ll 
b/llvm/test/CodeGen/AArch64/aarch64-smull.ll
index 17a21e566ec4..0a692192ec8b 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-smull.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-smull.ll
@@ -96,9 +96,7 @@ define <8 x i16> @amull_v8i8_v8i16(<8 x i8>* %A, <8 x i8>* 
%B) nounwind {
 ; CHECK:   // %bb.0:
 ; CHECK-NEXT:ldr d0, [x0]
 ; CHECK-NEXT:ldr d1, [x1]
-; CHECK-NEXT:ushll v0.8h, v0.8b, #0
-; CHECK-NEXT:ushll v1.8h, v1.8b, #0
-; CHECK-NEXT:mul v0.8h, v0.8h, v1.8h
+; CHECK-NEXT:umull v0.8h, v0.8b, v1.8b
 ; CHECK-NEXT:bic v0.8h, #255, lsl #8
 ; CHECK-NEXT:ret
   %tmp1 = load <8 x i8>, <8 x i8>* %A
@@ -115,9 +113,7 @@ define <4 x i32> @amull_v4i16_v4i32(<4 x i16>* %A, <4 x 
i16>* %B) nounwind {
 ; CHECK:   // %bb.0:
 ; CHECK-NEXT:ldr d0, [x0]
 ; CHECK-NEXT:ldr d1, [x1]
-; CHECK-NEXT:ushll v0.4s, v0.4h, #0
-; CHECK-NEXT:ushll v1.4s, v1.4h, #0
-; CHECK-NEXT:mul v0.4s, v0.4s, v1.4s
+; CHECK-NEXT:umull v0.4s, v0.4h, v1.4h
 ; CHECK-NEXT:movi v1.2d, #0x00
 ; CHECK-NEXT:and v0.16b, v0.16b, v1.16b
 ; CHECK-NEXT:ret
@@ -135,16 +131,7 @@ define <2 x i64> @amull_v2i32_v2i64(<2 x i32>* %A, <2 x 
i32>* %B) nounwind {
 ; CHECK:   // %bb.0:
 ; CHECK-NEXT:ldr d0, [x0]
 ; CHECK-NEXT:ldr d1, [x1]
-; CHECK-NEXT:ushll v0.2d, v0.2s, #0
-; CHECK-NEXT:ushll v1.2d, v1.2s, #0
-; CHECK-NEXT:fmov x10, d1
-; CHECK-NEXT:fmov x11, d0
-; CHECK-NEXT:mov x8, v1.d[1]
-; CHECK-NEXT:mov x9, v0.d[1]
-; CHECK-NEXT:mul x10, x11, x10
-; CHECK-NEXT:mul x8, x9, x8
-; CHECK-NEXT:fmov d0, x10
-; CHECK-NEXT:mov v0.d[1], x8
+; CHECK-NEXT:umull v0.2d, v0.2s, v1.2s
 ; CHECK-NEXT:movi v1.2d, #0x00
 ; CHECK-NEXT:and v0.16b, v0.16b, v1.16b
 ; CHECK-NEXT:ret
@@ -268,12 +255,10 @@ define <2 x i64> @umlal_v2i32_v2i64(<2 x i64>* %A, <2 x 
i32>* %B, <2 x i32>* %C)
 define <8 x i16> @amlal_v8i8_v8i16(<8 x i16>* %A, <8 x i8>* %B, <8 x i8>* %C) 
nounwind {
 ; CHECK-LABEL: amlal_v8i8_v8i16:
 ; CHECK:   // %bb.0:
+; CHECK-NEXT:ldr q0, [x0]
 ; CHECK-NEXT:ldr d1, [x1]
 ; CHECK-NEXT:ldr d2, [x2]
-; CHECK-NEXT:ldr q0, [x0]
-; CHECK-NEXT:ushll v1.8h, v1.8b, #0
-; CHECK-NEXT:ushll v2.8h, v2.8b, #0
-; CHECK-NEXT:mla v0.8h, v1.8h, v2.8h
+; CHECK-NEXT:umlal v0.8h, v1.8b, v2.8b
 ; CHECK-NEXT:bic v0.8h, #255, lsl #8
 ; CHECK-NEXT:ret
   %tmp1 = load <8 x i16>, <8 x i16>* %A
@@ -290,14 +275,12 @@ define <8 x i16> @amlal_v8i8_v8i16(<8 x i16>* %A, <8 x 
i8>* %B, <8 x i8>* %C) no
 define <4 x i32> @amlal_v4i16_v4i32(<4 x i32>* %A, <4 x i16>* %B, <4 x i16>* 
%C) nounwind {
 ; CHECK-LABEL: amlal_v4i16_v4i32:
 ; CHECK:   // %bb.0:
-; CHECK-NEXT:ldr d0, [x1]
-; CHECK-NEXT:ldr d1, [x2]
-; CHECK-NEXT:ldr q2, [x0]
-; CHECK-NEXT:ushll v0.4s, v0.4h, #0
-; CHECK-NEXT:ushll v1.4s, v1.4h, 

[llvm-branch-commits] [llvm] 0c59a4d - [ARM][AArch64] Some extra test to show anyextend lowering. NFC

2021-01-05 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-05T17:34:23Z
New Revision: 0c59a4da59a42f04ba932fdab806fc4473d4e0b5

URL: 
https://github.com/llvm/llvm-project/commit/0c59a4da59a42f04ba932fdab806fc4473d4e0b5
DIFF: 
https://github.com/llvm/llvm-project/commit/0c59a4da59a42f04ba932fdab806fc4473d4e0b5.diff

LOG: [ARM][AArch64] Some extra test to show anyextend lowering. NFC

Added: 
llvm/test/CodeGen/AArch64/lowerMUL-newload.ll

Modified: 
llvm/test/CodeGen/ARM/lowerMUL-newload.ll

Removed: 




diff  --git a/llvm/test/CodeGen/AArch64/lowerMUL-newload.ll 
b/llvm/test/CodeGen/AArch64/lowerMUL-newload.ll
new file mode 100644
index ..530aeed7f34c
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/lowerMUL-newload.ll
@@ -0,0 +1,439 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=aarch64-none-eabi | FileCheck %s
+
+define <4 x i16> @mlai16_trunc(<4 x i16> %vec0, <4 x i16> %vec1, <4 x i16> 
%vec2) {
+; CHECK-LABEL: mlai16_trunc:
+; CHECK:   // %bb.0: // %entry
+; CHECK-NEXT:smull v0.4s, v1.4h, v0.4h
+; CHECK-NEXT:saddw v0.4s, v0.4s, v2.4h
+; CHECK-NEXT:xtn v0.4h, v0.4s
+; CHECK-NEXT:ret
+entry:
+  %v0 = sext <4 x i16> %vec0 to <4 x i32>
+  %v1 = sext <4 x i16> %vec1 to <4 x i32>
+  %v2 = sext <4 x i16> %vec2 to <4 x i32>
+  %v3 = mul <4 x i32> %v1, %v0
+  %v4 = add <4 x i32> %v3, %v2
+  %v5 = trunc <4 x i32> %v4 to <4 x i16>
+  ret <4 x i16> %v5
+}
+
+define <4 x i32> @mlai16_and(<4 x i16> %vec0, <4 x i16> %vec1, <4 x i16> 
%vec2) {
+; CHECK-LABEL: mlai16_and:
+; CHECK:   // %bb.0: // %entry
+; CHECK-NEXT:ushll v0.4s, v0.4h, #0
+; CHECK-NEXT:ushll v1.4s, v1.4h, #0
+; CHECK-NEXT:ushll v2.4s, v2.4h, #0
+; CHECK-NEXT:mla v2.4s, v1.4s, v0.4s
+; CHECK-NEXT:movi v0.2d, #0x00
+; CHECK-NEXT:and v0.16b, v2.16b, v0.16b
+; CHECK-NEXT:ret
+entry:
+  %v0 = sext <4 x i16> %vec0 to <4 x i32>
+  %v1 = sext <4 x i16> %vec1 to <4 x i32>
+  %v2 = sext <4 x i16> %vec2 to <4 x i32>
+  %v3 = mul <4 x i32> %v1, %v0
+  %v4 = add <4 x i32> %v3, %v2
+  %v5 = and <4 x i32> %v4, 
+  ret <4 x i32> %v5
+}
+
+define void @mlai16_loadstore(i16* %a, i16* %b, i16* %c) {
+; CHECK-LABEL: mlai16_loadstore:
+; CHECK:   // %bb.0: // %entry
+; CHECK-NEXT:ldr d0, [x0, #16]
+; CHECK-NEXT:ldr d1, [x1, #16]
+; CHECK-NEXT:ldr d2, [x2, #16]
+; CHECK-NEXT:smull v0.4s, v1.4h, v0.4h
+; CHECK-NEXT:saddw v0.4s, v0.4s, v2.4h
+; CHECK-NEXT:xtn v0.4h, v0.4s
+; CHECK-NEXT:str d0, [x0, #16]
+; CHECK-NEXT:ret
+entry:
+  %scevgep0 = getelementptr i16, i16* %a, i32 8
+  %vector_ptr0 = bitcast i16* %scevgep0 to <4 x i16>*
+  %vec0 = load <4 x i16>, <4 x i16>* %vector_ptr0, align 8
+  %v0 = sext <4 x i16> %vec0 to <4 x i32>
+  %scevgep1 = getelementptr i16, i16* %b, i32 8
+  %vector_ptr1 = bitcast i16* %scevgep1 to <4 x i16>*
+  %vec1 = load <4 x i16>, <4 x i16>* %vector_ptr1, align 8
+  %v1 = sext <4 x i16> %vec1 to <4 x i32>
+  %scevgep2 = getelementptr i16, i16* %c, i32 8
+  %vector_ptr2 = bitcast i16* %scevgep2 to <4 x i16>*
+  %vec2 = load <4 x i16>, <4 x i16>* %vector_ptr2, align 8
+  %v2 = sext <4 x i16> %vec2 to <4 x i32>
+  %v3 = mul <4 x i32> %v1, %v0
+  %v4 = add <4 x i32> %v3, %v2
+  %v5 = trunc <4 x i32> %v4 to <4 x i16>
+  %scevgep3 = getelementptr i16, i16* %a, i32 8
+  %vector_ptr3 = bitcast i16* %scevgep3 to <4 x i16>*
+  store <4 x i16> %v5, <4 x i16>* %vector_ptr3, align 8
+  ret void
+}
+
+define <4 x i16> @addmuli16_trunc(<4 x i16> %vec0, <4 x i16> %vec1, <4 x i16> 
%vec2) {
+; CHECK-LABEL: addmuli16_trunc:
+; CHECK:   // %bb.0: // %entry
+; CHECK-NEXT:smull v1.4s, v1.4h, v2.4h
+; CHECK-NEXT:smlal v1.4s, v0.4h, v2.4h
+; CHECK-NEXT:xtn v0.4h, v1.4s
+; CHECK-NEXT:ret
+entry:
+  %v0 = sext <4 x i16> %vec0 to <4 x i32>
+  %v1 = sext <4 x i16> %vec1 to <4 x i32>
+  %v2 = sext <4 x i16> %vec2 to <4 x i32>
+  %v3 = add <4 x i32> %v1, %v0
+  %v4 = mul <4 x i32> %v3, %v2
+  %v5 = trunc <4 x i32> %v4 to <4 x i16>
+  ret <4 x i16> %v5
+}
+
+define <4 x i32> @addmuli16_and(<4 x i16> %vec0, <4 x i16> %vec1, <4 x i16> 
%vec2) {
+; CHECK-LABEL: addmuli16_and:
+; CHECK:   // %bb.0: // %entry
+; CHECK-NEXT:ushll v0.4s, v0.4h, #0
+; CHECK-NEXT:ushll v1.4s, v1.4h, #0
+; CHECK-NEXT:ushll v2.4s, v2.4h, #0
+; CHECK-NEXT:add v0.4s, v1.4s, v0.4s
+; CHECK-NEXT:mul v0.4s, v0.4s, v2.4s
+; CHECK-NEXT:movi v1.2d, #0x00
+; CHECK-NEXT:and v0.16b, v0.16b, v1.16b
+; CHECK-NEXT:ret
+entry:
+  %v0 = sext <4 x i16> %vec0 to <4 x i32>
+  %v1 = sext <4 x i16> %vec1 to <4 x i32>
+  %v2 = sext <4 x i16> %vec2 to <4 x i32>
+  %v3 = add <4 x i32> %v1, %v0
+  %v4 = mul <4 x i32> %v3, %v2
+  %v5 = and <4 x i32> %v4, 
+  ret <4 x i32> %v5
+}
+
+define void @addmuli16_loadstore(i16* %a, i16* %b, i16* %c) {
+; CHECK-LABEL: addmuli16_loadstore:
+; CHECK:   // %bb.0: // %entry
+; 

[llvm-branch-commits] [llvm] 901cc9b - [ARM] Extend lowering for i64 reductions

2021-01-04 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-04T12:44:43Z
New Revision: 901cc9b6f30f120f2fbdc01f9eec3708c512186b

URL: 
https://github.com/llvm/llvm-project/commit/901cc9b6f30f120f2fbdc01f9eec3708c512186b
DIFF: 
https://github.com/llvm/llvm-project/commit/901cc9b6f30f120f2fbdc01f9eec3708c512186b.diff

LOG: [ARM] Extend lowering for i64 reductions

The lowering of a <4 x i16> or <4 x i8> vecreduce.add into an i64 would
previously be expanded, due to the i64 not being legal. This patch
adjusts our reduction matchers, making it produce a VADDLV(sext A to
v4i32) instead.

Differential Revision: https://reviews.llvm.org/D93622

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/Thumb2/mve-vecreduce-add.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-addpred.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-mla.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-mlapred.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 6eb1bdffdac4..6a8355f0c3e8 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -14959,12 +14959,23 @@ static SDValue PerformVECREDUCE_ADDCombine(SDNode *N, 
SelectionDAG ,
   //   VADDLV u/s 32
   //   VMLALV u/s 16/32
 
+  // If the input vector is smaller than legal (v4i8/v4i16 for example) we can
+  // extend it and use v4i32 instead.
+  auto ExtendIfNeeded = [&](SDValue A, unsigned ExtendCode) {
+EVT AVT = A.getValueType();
+if (!AVT.is128BitVector())
+  A = DAG.getNode(ExtendCode, dl,
+  AVT.changeVectorElementType(MVT::getIntegerVT(
+  128 / AVT.getVectorMinNumElements())),
+  A);
+return A;
+  };
   auto IsVADDV = [&](MVT RetTy, unsigned ExtendCode, ArrayRef ExtTypes) {
 if (ResVT != RetTy || N0->getOpcode() != ExtendCode)
   return SDValue();
 SDValue A = N0->getOperand(0);
 if (llvm::any_of(ExtTypes, [](MVT Ty) { return A.getValueType() == Ty; 
}))
-  return A;
+  return ExtendIfNeeded(A, ExtendCode);
 return SDValue();
   };
   auto IsPredVADDV = [&](MVT RetTy, unsigned ExtendCode,
@@ -14978,7 +14989,7 @@ static SDValue PerformVECREDUCE_ADDCombine(SDNode *N, 
SelectionDAG ,
   return SDValue();
 SDValue A = Ext->getOperand(0);
 if (llvm::any_of(ExtTypes, [](MVT Ty) { return A.getValueType() == Ty; 
}))
-  return A;
+  return ExtendIfNeeded(A, ExtendCode);
 return SDValue();
   };
   auto IsVMLAV = [&](MVT RetTy, unsigned ExtendCode, ArrayRef ExtTypes,
@@ -15007,8 +15018,12 @@ static SDValue PerformVECREDUCE_ADDCombine(SDNode *N, 
SelectionDAG ,
 A = ExtA->getOperand(0);
 B = ExtB->getOperand(0);
 if (A.getValueType() == B.getValueType() &&
-llvm::any_of(ExtTypes, [](MVT Ty) { return A.getValueType() == Ty; 
}))
+llvm::any_of(ExtTypes,
+ [](MVT Ty) { return A.getValueType() == Ty; })) {
+  A = ExtendIfNeeded(A, ExtendCode);
+  B = ExtendIfNeeded(B, ExtendCode);
   return true;
+}
 return false;
   };
   auto IsPredVMLAV = [&](MVT RetTy, unsigned ExtendCode, ArrayRef 
ExtTypes,
@@ -15037,8 +15052,12 @@ static SDValue PerformVECREDUCE_ADDCombine(SDNode *N, 
SelectionDAG ,
 A = ExtA->getOperand(0);
 B = ExtB->getOperand(0);
 if (A.getValueType() == B.getValueType() &&
-llvm::any_of(ExtTypes, [](MVT Ty) { return A.getValueType() == Ty; 
}))
+llvm::any_of(ExtTypes,
+ [](MVT Ty) { return A.getValueType() == Ty; })) {
+  A = ExtendIfNeeded(A, ExtendCode);
+  B = ExtendIfNeeded(B, ExtendCode);
   return true;
+}
 return false;
   };
   auto Create64bitNode = [&](unsigned Opcode, ArrayRef Ops) {
@@ -15051,9 +15070,11 @@ static SDValue PerformVECREDUCE_ADDCombine(SDNode *N, 
SelectionDAG ,
 return DAG.getNode(ARMISD::VADDVs, dl, ResVT, A);
   if (SDValue A = IsVADDV(MVT::i32, ISD::ZERO_EXTEND, {MVT::v8i16, 
MVT::v16i8}))
 return DAG.getNode(ARMISD::VADDVu, dl, ResVT, A);
-  if (SDValue A = IsVADDV(MVT::i64, ISD::SIGN_EXTEND, {MVT::v4i32}))
+  if (SDValue A = IsVADDV(MVT::i64, ISD::SIGN_EXTEND,
+  {MVT::v4i8, MVT::v4i16, MVT::v4i32}))
 return Create64bitNode(ARMISD::VADDLVs, {A});
-  if (SDValue A = IsVADDV(MVT::i64, ISD::ZERO_EXTEND, {MVT::v4i32}))
+  if (SDValue A = IsVADDV(MVT::i64, ISD::ZERO_EXTEND,
+  {MVT::v4i8, MVT::v4i16, MVT::v4i32}))
 return Create64bitNode(ARMISD::VADDLVu, {A});
   if (SDValue A = IsVADDV(MVT::i16, ISD::SIGN_EXTEND, {MVT::v16i8}))
 return DAG.getNode(ISD::TRUNCATE, dl, ResVT,
@@ -15067,9 +15088,11 @@ static SDValue PerformVECREDUCE_ADDCombine(SDNode *N, 
SelectionDAG ,
 return DAG.getNode(ARMISD::VADDVps, dl, ResVT, A, Mask);
   if (SDValue A = IsPredVADDV(MVT::i32, ISD::ZERO_EXTEND, {MVT::v8i16, 
MVT::v16i8}, Mask))
  

[llvm-branch-commits] [llvm] 6c89f6f - [AArch64] Attempt to fix Mac tests with a more specific triple. NFC

2021-01-04 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2021-01-04T11:29:18Z
New Revision: 6c89f6fae4913eba07093fe7c268e828f801c78b

URL: 
https://github.com/llvm/llvm-project/commit/6c89f6fae4913eba07093fe7c268e828f801c78b
DIFF: 
https://github.com/llvm/llvm-project/commit/6c89f6fae4913eba07093fe7c268e828f801c78b.diff

LOG: [AArch64] Attempt to fix Mac tests with a more specific triple. NFC

Added: 


Modified: 
llvm/test/tools/llvm-mca/AArch64/Cortex/forwarding-A57.s

Removed: 




diff  --git a/llvm/test/tools/llvm-mca/AArch64/Cortex/forwarding-A57.s 
b/llvm/test/tools/llvm-mca/AArch64/Cortex/forwarding-A57.s
index a71c99400c4e..f111c4101ab0 100644
--- a/llvm/test/tools/llvm-mca/AArch64/Cortex/forwarding-A57.s
+++ b/llvm/test/tools/llvm-mca/AArch64/Cortex/forwarding-A57.s
@@ -1,4 +1,4 @@
-# RUN: llvm-mca -march=aarch64 -mcpu=cortex-a57 -iterations=1 -timeline < %s | 
FileCheck %s
+# RUN: llvm-mca -mtriple=aarch64-none-eabi -mcpu=cortex-a57 -iterations=1 
-timeline < %s | FileCheck %s
 
 # CHECK: [0] Code Region
 # CHECK: Instructions:  2



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] 685c8b5 - [AARCH64] Improve accumulator forwarding for Cortex-A57 model

2021-01-04 Thread David Green via llvm-branch-commits

Author: Usman Nadeem
Date: 2021-01-04T10:58:43Z
New Revision: 685c8b537af3138cff24ec6060a86140b8963a1e

URL: 
https://github.com/llvm/llvm-project/commit/685c8b537af3138cff24ec6060a86140b8963a1e
DIFF: 
https://github.com/llvm/llvm-project/commit/685c8b537af3138cff24ec6060a86140b8963a1e.diff

LOG: [AARCH64] Improve accumulator forwarding for Cortex-A57 model

The old CPU model only had MLA->MLA forwarding. I added some missing
MUL->MLA read advances and a missing absolute diff accumulator read
advance according to the Cortex A57 Software Optimization Guide.

The patch improves performance in EEMBC rgbyiqv2 by about 6%-7% and
spec2006/milc by 8% (repeated runs on multiple devices), causes no
significant regressions (none in SPEC).

Differential Revision: https://reviews.llvm.org/D92296

Added: 
llvm/test/tools/llvm-mca/AArch64/Cortex/forwarding-A57.s

Modified: 
llvm/lib/Target/AArch64/AArch64SchedA57.td
llvm/lib/Target/AArch64/AArch64SchedA57WriteRes.td

Removed: 




diff  --git a/llvm/lib/Target/AArch64/AArch64SchedA57.td 
b/llvm/lib/Target/AArch64/AArch64SchedA57.td
index 7c40da05c305..aa5bec8088e4 100644
--- a/llvm/lib/Target/AArch64/AArch64SchedA57.td
+++ b/llvm/lib/Target/AArch64/AArch64SchedA57.td
@@ -93,7 +93,7 @@ def : SchedAlias;
 def : SchedAlias;
 def : SchedAlias;
 def : SchedAlias;
-def : SchedAlias;
+def : WriteRes { let Latency = 5;}
 def : SchedAlias;
 def : SchedAlias;
 def : SchedAlias;
@@ -350,12 +350,16 @@ def : InstRW<[A57Write_8cyc_8S, WriteAdr],  
(instregex "ST4Fourv(2d)_POST$")
 //   D form - v8i8_v8i16, v4i16_v4i32, v2i32_v2i64
 //   Q form - v16i8_v8i16, v8i16_v4i32, v4i32_v2i64
 
+// Cortex A57 Software Optimization Guide Sec 3.14
+// Advance for absolute 
diff  accum, pairwise add and accumulate, shift accumulate
+def A57ReadIVA3 : SchedReadAdvance<3, [A57Write_4cyc_1X_NonMul_Forward, 
A57Write_5cyc_2X_NonMul_Forward]>;
+
 // ASIMD absolute 
diff  accum, D-form
-def : InstRW<[A57Write_4cyc_1X], (instregex "^[SU]ABA(v8i8|v4i16|v2i32)$")>;
+def : InstRW<[A57Write_4cyc_1X_NonMul_Forward, A57ReadIVA3], (instregex 
"^[SU]ABA(v8i8|v4i16|v2i32)$")>;
 // ASIMD absolute 
diff  accum, Q-form
-def : InstRW<[A57Write_5cyc_2X], (instregex "^[SU]ABA(v16i8|v8i16|v4i32)$")>;
+def : InstRW<[A57Write_5cyc_2X_NonMul_Forward, A57ReadIVA3], (instregex 
"^[SU]ABA(v16i8|v8i16|v4i32)$")>;
 // ASIMD absolute 
diff  accum long
-def : InstRW<[A57Write_4cyc_1X], (instregex "^[SU]ABAL")>;
+def : InstRW<[A57Write_4cyc_1X_NonMul_Forward, A57ReadIVA3], (instregex 
"^[SU]ABAL")>;
 
 // ASIMD arith, reduce, 4H/4S
 def : InstRW<[A57Write_4cyc_1X], (instregex 
"^[SU]?ADDL?V(v8i8|v4i16|v2i32)v$")>;
@@ -372,32 +376,41 @@ def : InstRW<[A57Write_7cyc_1V_1X], (instregex 
"^[SU](MIN|MAX)V(v8i8|v8i16)v$")>
 def : InstRW<[A57Write_8cyc_2X], (instregex "^[SU](MIN|MAX)Vv16i8v$")>;
 
 // ASIMD multiply, D-form
-def : InstRW<[A57Write_5cyc_1W], (instregex 
"^(P?MUL|SQR?DMULH)(v8i8|v4i16|v2i32|v1i8|v1i16|v1i32|v1i64)(_indexed)?$")>;
+// MUL
+def : InstRW<[A57Write_5cyc_1W_Mul_Forward], (instregex 
"^MUL(v8i8|v4i16|v2i32|v1i8|v1i16|v1i32|v1i64)(_indexed)?$")>;
+// PMUL, SQDMULH, SQRDMULH
+def : InstRW<[A57Write_5cyc_1W], (instregex 
"^(PMUL|SQR?DMULH)(v8i8|v4i16|v2i32|v1i8|v1i16|v1i32|v1i64)(_indexed)?$")>;
+
 // ASIMD multiply, Q-form
-def : InstRW<[A57Write_6cyc_2W], (instregex 
"^(P?MUL|SQR?DMULH)(v16i8|v8i16|v4i32)(_indexed)?$")>;
+// MUL
+def : InstRW<[A57Write_6cyc_2W_Mul_Forward], (instregex 
"^MUL(v16i8|v8i16|v4i32)(_indexed)?$")>;
+// PMUL, SQDMULH, SQRDMULH
+def : InstRW<[A57Write_6cyc_2W], (instregex 
"^(PMUL|SQR?DMULH)(v16i8|v8i16|v4i32)(_indexed)?$")>;
+
+// Cortex A57 Software Optimization Guide Sec 3.14
+def A57ReadIVMA4   : SchedReadAdvance<4 , [A57Write_5cyc_1W_Mul_Forward, 
A57Write_6cyc_2W_Mul_Forward]>;
+def A57ReadIVMA3   : SchedReadAdvance<3 , [A57Write_5cyc_1W_Mul_Forward, 
A57Write_6cyc_2W_Mul_Forward]>;
 
 // ASIMD multiply accumulate, D-form
-def : InstRW<[A57Write_5cyc_1W], (instregex 
"^ML[AS](v8i8|v4i16|v2i32)(_indexed)?$")>;
+def : InstRW<[A57Write_5cyc_1W_Mul_Forward, A57ReadIVMA4], (instregex 
"^ML[AS](v8i8|v4i16|v2i32)(_indexed)?$")>;
 // ASIMD multiply accumulate, Q-form
-def : InstRW<[A57Write_6cyc_2W], (instregex 
"^ML[AS](v16i8|v8i16|v4i32)(_indexed)?$")>;
+def : InstRW<[A57Write_6cyc_2W_Mul_Forward, A57ReadIVMA4], (instregex 
"^ML[AS](v16i8|v8i16|v4i32)(_indexed)?$")>;
 
 // ASIMD multiply accumulate long
 // ASIMD multiply accumulate saturating long
-def A57WriteIVMA   : SchedWriteRes<[A57UnitW]> { let Latency = 5;  }
-def A57ReadIVMA4   : SchedReadAdvance<4, [A57WriteIVMA]>;
-def : InstRW<[A57WriteIVMA, A57ReadIVMA4], (instregex "^(S|U|SQD)ML[AS]L")>;
+def : InstRW<[A57Write_5cyc_1W_Mul_Forward, A57ReadIVMA4], (instregex 
"^(S|U)ML[AS]L")>;
+def : InstRW<[A57Write_5cyc_1W_Mul_Forward, A57ReadIVMA3], (instregex 
"^SQDML[AS]L")>;
 
 // ASIMD multiply long
-def : InstRW<[A57Write_5cyc_1W], 

[llvm-branch-commits] [llvm] a9f14cd - [ARM] Add bank conflict hazarding

2020-12-23 Thread David Green via llvm-branch-commits

Author: David Penry
Date: 2020-12-23T14:00:59Z
New Revision: a9f14cdc6203c05425f8b17228ff368f7fd9ae29

URL: 
https://github.com/llvm/llvm-project/commit/a9f14cdc6203c05425f8b17228ff368f7fd9ae29
DIFF: 
https://github.com/llvm/llvm-project/commit/a9f14cdc6203c05425f8b17228ff368f7fd9ae29.diff

LOG: [ARM] Add bank conflict hazarding

Adds ARMBankConflictHazardRecognizer. This hazard recognizer
looks for a few situations where the same base pointer is used and
then checks whether the offsets lead to a bank conflict. Two
parameters are also added to permit overriding of the target
assumptions:

arm-data-bank-mask= - Mask of bits which are to be checked for
conflicts.  If all these bits are equal in the offsets, there is a
conflict.
arm-assume-itcm-bankconflict= - Assume that there will be bank
conflicts on any loads to a constant pool.

This hazard recognizer is enabled for Cortex-M7, where the Technical
Reference Manual states that there are two DTCM banks banked using bit
2 and one ITCM bank.

Differential Revision: https://reviews.llvm.org/D93054

Added: 
llvm/test/CodeGen/Thumb2/schedm7-hazard.ll

Modified: 
llvm/lib/Target/ARM/ARM.td
llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
llvm/lib/Target/ARM/ARMHazardRecognizer.cpp
llvm/lib/Target/ARM/ARMHazardRecognizer.h
llvm/lib/Target/ARM/ARMSubtarget.cpp
llvm/lib/Target/ARM/ARMSubtarget.h

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARM.td b/llvm/lib/Target/ARM/ARM.td
index 8111346c74f6..5d626e7d8e5a 100644
--- a/llvm/lib/Target/ARM/ARM.td
+++ b/llvm/lib/Target/ARM/ARM.td
@@ -660,7 +660,8 @@ def ProcR52 : SubtargetFeature<"r52", "ARMProcFamily", 
"CortexR52",
 
 def ProcM3  : SubtargetFeature<"m3", "ARMProcFamily", "CortexM3",
"Cortex-M3 ARM processors", []>;
-
+def ProcM7  : SubtargetFeature<"m7", "ARMProcFamily", "CortexM7",
+   "Cortex-M7 ARM processors", []>;
 
 
//===--===//
 // ARM Helper classes.
@@ -1191,6 +1192,7 @@ def : ProcessorModel<"cortex-m4", CortexM4Model,
[ARMv7em,
  
FeatureHasNoBranchPredictor]>;
 
 def : ProcessorModel<"cortex-m7", CortexM7Model,[ARMv7em,
+ ProcM7,
  FeatureFPARMv8_D16,
  FeatureUseMISched]>;
 

diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
index def631276950..563f2d38edf0 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
@@ -134,6 +134,31 @@ ARMBaseInstrInfo::CreateTargetHazardRecognizer(const 
TargetSubtargetInfo *STI,
   return TargetInstrInfo::CreateTargetHazardRecognizer(STI, DAG);
 }
 
+// Called during:
+// - pre-RA scheduling
+// - post-RA scheduling when FeatureUseMISched is set
+ScheduleHazardRecognizer *ARMBaseInstrInfo::CreateTargetMIHazardRecognizer(
+const InstrItineraryData *II, const ScheduleDAGMI *DAG) const {
+  MultiHazardRecognizer *MHR = new MultiHazardRecognizer();
+
+  // We would like to restrict this hazard recognizer to only
+  // post-RA scheduling; we can tell that we're post-RA because we don't
+  // track VRegLiveness.
+  // Cortex-M7: TRM indicates that there is a single ITCM bank and two DTCM
+  //banks banked on bit 2.  Assume that TCMs are in use.
+  if (Subtarget.isCortexM7() && !DAG->hasVRegLiveness())
+MHR->AddHazardRecognizer(
+std::make_unique(DAG, 0x4, true));
+
+  // Not inserting ARMHazardRecognizerFPMLx because that would change
+  // legacy behavior
+
+  auto BHR = TargetInstrInfo::CreateTargetMIHazardRecognizer(II, DAG);
+  MHR->AddHazardRecognizer(std::unique_ptr(BHR));
+  return MHR;
+}
+
+// Called during post-RA scheduling when FeatureUseMISched is not set
 ScheduleHazardRecognizer *ARMBaseInstrInfo::
 CreateTargetPostRAHazardRecognizer(const InstrItineraryData *II,
const ScheduleDAG *DAG) const {

diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
index 9b6572848ebe..deb008025b1d 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
@@ -131,6 +131,10 @@ class ARMBaseInstrInfo : public ARMGenInstrInfo {
   CreateTargetHazardRecognizer(const TargetSubtargetInfo *STI,
const ScheduleDAG *DAG) const override;
 
+  ScheduleHazardRecognizer *
+  CreateTargetMIHazardRecognizer(const InstrItineraryData *II,
+ const ScheduleDAGMI *DAG) const override;
+
   ScheduleHazardRecognizer *
   CreateTargetPostRAHazardRecognizer(const 

[llvm-branch-commits] [llvm] f47bac5 - [ARM] Extra vecreduce tests with smaller than legal types. NFC

2020-12-20 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-20T21:20:39Z
New Revision: f47bac5dd207c672e803e08685e084e5d66f8bce

URL: 
https://github.com/llvm/llvm-project/commit/f47bac5dd207c672e803e08685e084e5d66f8bce
DIFF: 
https://github.com/llvm/llvm-project/commit/f47bac5dd207c672e803e08685e084e5d66f8bce.diff

LOG: [ARM] Extra vecreduce tests with smaller than legal types. NFC

Added: 


Modified: 
llvm/test/CodeGen/Thumb2/mve-vecreduce-add.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-addpred.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-mla.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-mlapred.ll

Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-vecreduce-add.ll 
b/llvm/test/CodeGen/Thumb2/mve-vecreduce-add.ll
index 995926a1502e..f882582bf148 100644
--- a/llvm/test/CodeGen/Thumb2/mve-vecreduce-add.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-vecreduce-add.ll
@@ -236,6 +236,53 @@ entry:
   ret i64 %z
 }
 
+define arm_aapcs_vfpcc i64 @add_v4i16_v4i64_zext(<4 x i16> %x) {
+; CHECK-LABEL: add_v4i16_v4i64_zext:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmovlb.u16 q0, q0
+; CHECK-NEXT:vaddlv.u32 r0, r1, q0
+; CHECK-NEXT:bx lr
+entry:
+  %xx = zext <4 x i16> %x to <4 x i64>
+  %z = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %xx)
+  ret i64 %z
+}
+
+define arm_aapcs_vfpcc i64 @add_v4i16_v4i64_sext(<4 x i16> %x) {
+; CHECK-LABEL: add_v4i16_v4i64_sext:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.f32 s4, s0
+; CHECK-NEXT:vmov.f32 s6, s1
+; CHECK-NEXT:vmov r1, s0
+; CHECK-NEXT:vmov r0, s6
+; CHECK-NEXT:sxth r1, r1
+; CHECK-NEXT:sxth r0, r0
+; CHECK-NEXT:vmov q1[2], q1[0], r1, r0
+; CHECK-NEXT:asrs r2, r0, #31
+; CHECK-NEXT:asrs r1, r1, #31
+; CHECK-NEXT:vmov q1[3], q1[1], r1, r2
+; CHECK-NEXT:vmov r2, s6
+; CHECK-NEXT:vmov r3, s4
+; CHECK-NEXT:vmov r1, s5
+; CHECK-NEXT:vmov.f32 s4, s2
+; CHECK-NEXT:vmov.f32 s6, s3
+; CHECK-NEXT:adds r2, r2, r3
+; CHECK-NEXT:adc.w r0, r1, r0, asr #31
+; CHECK-NEXT:vmov r1, s4
+; CHECK-NEXT:sxth r1, r1
+; CHECK-NEXT:adds r2, r2, r1
+; CHECK-NEXT:adc.w r1, r0, r1, asr #31
+; CHECK-NEXT:vmov r0, s6
+; CHECK-NEXT:sxth r3, r0
+; CHECK-NEXT:adds r0, r2, r3
+; CHECK-NEXT:adc.w r1, r1, r3, asr #31
+; CHECK-NEXT:bx lr
+entry:
+  %xx = sext <4 x i16> %x to <4 x i64>
+  %z = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %xx)
+  ret i64 %z
+}
+
 define arm_aapcs_vfpcc i64 @add_v2i16_v2i64_zext(<2 x i16> %x) {
 ; CHECK-LABEL: add_v2i16_v2i64_zext:
 ; CHECK:   @ %bb.0: @ %entry
@@ -294,6 +341,46 @@ entry:
   ret i32 %z
 }
 
+define arm_aapcs_vfpcc i32 @add_v8i8_v8i32_zext(<8 x i8> %x) {
+; CHECK-LABEL: add_v8i8_v8i32_zext:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmovlb.u8 q0, q0
+; CHECK-NEXT:vaddv.u16 r0, q0
+; CHECK-NEXT:bx lr
+entry:
+  %xx = zext <8 x i8> %x to <8 x i32>
+  %z = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %xx)
+  ret i32 %z
+}
+
+define arm_aapcs_vfpcc i32 @add_v8i8_v8i32_sext(<8 x i8> %x) {
+; CHECK-LABEL: add_v8i8_v8i32_sext:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.u16 r0, q0[6]
+; CHECK-NEXT:vmov.u16 r1, q0[4]
+; CHECK-NEXT:vmov q1[2], q1[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[7]
+; CHECK-NEXT:vmov.u16 r1, q0[5]
+; CHECK-NEXT:vmov q1[3], q1[1], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[2]
+; CHECK-NEXT:vmov.u16 r1, q0[0]
+; CHECK-NEXT:vmovlb.s8 q1, q1
+; CHECK-NEXT:vmov q2[2], q2[0], r1, r0
+; CHECK-NEXT:vmov.u16 r0, q0[3]
+; CHECK-NEXT:vmov.u16 r1, q0[1]
+; CHECK-NEXT:vmovlb.s16 q1, q1
+; CHECK-NEXT:vmov q2[3], q2[1], r1, r0
+; CHECK-NEXT:vmovlb.s8 q0, q2
+; CHECK-NEXT:vmovlb.s16 q0, q0
+; CHECK-NEXT:vadd.i32 q0, q0, q1
+; CHECK-NEXT:vaddv.u32 r0, q0
+; CHECK-NEXT:bx lr
+entry:
+  %xx = sext <8 x i8> %x to <8 x i32>
+  %z = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %xx)
+  ret i32 %z
+}
+
 define arm_aapcs_vfpcc i32 @add_v4i8_v4i32_zext(<4 x i8> %x) {
 ; CHECK-LABEL: add_v4i8_v4i32_zext:
 ; CHECK:   @ %bb.0: @ %entry
@@ -599,6 +686,165 @@ entry:
   ret i64 %z
 }
 
+define arm_aapcs_vfpcc i64 @add_v8i8_v8i64_zext(<8 x i8> %x) {
+; CHECK-LABEL: add_v8i8_v8i64_zext:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmovlb.u8 q0, q0
+; CHECK-NEXT:vmov.i64 q1, #0x
+; CHECK-NEXT:vmov.u16 r0, q0[1]
+; CHECK-NEXT:vmov.u16 r1, q0[0]
+; CHECK-NEXT:vmov q2[2], q2[0], r1, r0
+; CHECK-NEXT:vmov.u16 r2, q0[2]
+; CHECK-NEXT:vand q2, q2, q1
+; CHECK-NEXT:vmov r0, s10
+; CHECK-NEXT:vmov r1, s8
+; CHECK-NEXT:add r0, r1
+; CHECK-NEXT:vmov.u16 r1, q0[3]
+; CHECK-NEXT:vmov q3[2], q3[0], r2, r1
+; CHECK-NEXT:vmov.u16 r2, q0[4]
+; CHECK-NEXT:vand q3, q3, q1
+; CHECK-NEXT:vmov r1, s12
+; CHECK-NEXT:add r0, r1
+; CHECK-NEXT:vmov r1, s14
+; CHECK-NEXT:add r0, r1
+; CHECK-NEXT:vmov.u16 

[llvm-branch-commits] [llvm] 1de3e7f - [ARM] Improve handling of empty VPT blocks in tail predicated loops

2020-12-14 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-14T11:17:01Z
New Revision: 1de3e7fd620bc9db2df96a12401bde4bde722785

URL: 
https://github.com/llvm/llvm-project/commit/1de3e7fd620bc9db2df96a12401bde4bde722785
DIFF: 
https://github.com/llvm/llvm-project/commit/1de3e7fd620bc9db2df96a12401bde4bde722785.diff

LOG: [ARM] Improve handling of empty VPT blocks in tail predicated loops

A vpt block that just contains either VPST;VCTP or VPT;VCTP, once the
VCTP is removed will become invalid. This fixed the first by removing
the now empty block and bails out for the second, as we have no simple
way of converting a VPT to a VCMP.

Differential Revision: https://reviews.llvm.org/D92369

Added: 


Modified: 
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/vpt-blocks.mir

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp 
b/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
index 52ca722a2e0c..2b53f57a7f09 100644
--- a/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
+++ b/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
@@ -312,6 +312,14 @@ namespace {
   continue;
 
 SmallVectorImpl  = Block.getInsts();
+// We don't know how to convert a block with just a VPT;VCTP into
+// anything valid once we remove the VCTP. For now just bail out.
+assert(isVPTOpcode(Insts.front()->getOpcode()) &&
+   "Expected VPT block to start with a VPST or VPT!");
+if (Insts.size() == 2 && Insts.front()->getOpcode() != ARM::MVE_VPST &&
+isVCTP(Insts.back()))
+  return false;
+
 for (auto *MI : Insts) {
   // Check that any internal VCTPs are 'Then' predicated.
   if (isVCTP(MI) && getVPTInstrPredicate(*MI) != ARMVCC::Then)
@@ -1547,9 +1555,15 @@ void 
ARMLowOverheadLoops::ConvertVPTBlocks(LowOverheadLoop ) {
   LLVM_DEBUG(dbgs() << "ARM Loops: Removing VPST: " << *VPST);
   LoLoop.ToRemove.insert(VPST);
 } else if (Block.containsVCTP()) {
-  // The vctp will be removed, so the block mask of the vp(s)t will need
-  // to be recomputed.
-  LoLoop.BlockMasksToRecompute.insert(Insts.front());
+  // The vctp will be removed, so either the entire block will be dead or
+  // the block mask of the vp(s)t will need to be recomputed.
+  MachineInstr *VPST = Insts.front();
+  if (Block.size() == 2) {
+assert(VPST->getOpcode() == ARM::MVE_VPST &&
+   "Found a VPST in an otherwise empty vpt block");
+LoLoop.ToRemove.insert(VPST);
+  } else
+LoLoop.BlockMasksToRecompute.insert(VPST);
 } else if (Insts.front()->getOpcode() == ARM::MVE_VPST) {
   // If this block starts with a VPST then attempt to merge it with the
   // preceeding un-merged VCMP into a VPT. This VCMP comes from a VPT

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vpt-blocks.mir 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vpt-blocks.mir
index f7e1d86fd1b0..ab6d05ca6aac 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vpt-blocks.mir
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/vpt-blocks.mir
@@ -149,6 +149,16 @@
   unreachable
   }
 
+  define arm_aapcs_vfpcc void @emptyblock() {
+unreachable
+  }
+  define arm_aapcs_vfpcc void @predvcmp() {
+unreachable
+  }
+  define arm_aapcs_vfpcc void @predvpt() {
+unreachable
+  }
+
   declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 
x i1>, <4 x i32>)
   declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 
immarg, <4 x i1>)
   declare i32 @llvm.start.loop.iterations.i32(i32)
@@ -835,7 +845,7 @@ body: |
   ; CHECK: bb.2.vector.body:
   ; CHECK:   successors: %bb.2(0x7c00), %bb.3(0x0400)
   ; CHECK:   liveins: $lr, $q0, $r2, $r3
-  ; CHECK:   MVE_VPTv4s32r 2, renamable $q0, renamable $r2, 8, implicit-def 
$vpr
+  ; CHECK:   MVE_VPTv4s32r 8, renamable $q0, renamable $r2, 8, implicit-def 
$vpr
   ; CHECK:   dead renamable $vpr = MVE_VCMPs32r renamable $q0, renamable $r3, 
12, 1, killed renamable $vpr
   ; CHECK:   $lr = MVE_LETP killed renamable $lr, %bb.2
   ; CHECK: bb.3.for.cond.cleanup:
@@ -868,7 +878,7 @@ body: |
 successors: %bb.2(0x7c00), %bb.3(0x0400)
 liveins: $lr, $q0, $r0, $r1, $r2, $r3
 
-MVE_VPTv4s32r 2, renamable $q0, renamable $r2, 8, implicit-def $vpr
+MVE_VPTv4s32r 8, renamable $q0, renamable $r2, 8, implicit-def $vpr
 renamable $vpr = MVE_VCMPs32r killed renamable $q0, renamable $r3, 12, 1, 
killed renamable $vpr
 renamable $vpr = MVE_VCTP32 renamable $r1, 0, $noreg
 renamable $r1, dead $cpsr = tSUBi8 killed renamable $r1, 4, 14 /* CC::al 
*/, $noreg
@@ -1001,3 +1011,308 @@ body: |
   bb.3.for.cond.cleanup:
 frame-destroy tPOP_RET 14 /* CC::al */, $noreg, def $r7, def $pc
 ...
+---
+name:emptyblock
+tracksRegLiveness: true
+liveins:
+  - { reg: 

[llvm-branch-commits] [llvm] ab97c9b - [LV] Fix scalar cost for tail predicated loops

2020-12-12 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-12T14:21:40Z
New Revision: ab97c9bdb747c873cd35a18229e2694156a7607d

URL: 
https://github.com/llvm/llvm-project/commit/ab97c9bdb747c873cd35a18229e2694156a7607d
DIFF: 
https://github.com/llvm/llvm-project/commit/ab97c9bdb747c873cd35a18229e2694156a7607d.diff

LOG: [LV] Fix scalar cost for tail predicated loops

When it comes to the scalar cost of any predicated block, the loop
vectorizer by default regards this predication as a sign that it is
looking at an if-conversion and divides the scalar cost of the block by
2, assuming it would only be executed half the time. This however makes
no sense if the predication has been introduced to tail predicate the
loop.

Original patch by Anna Welker

Differential Revision: https://reviews.llvm.org/D86452

Added: 


Modified: 
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll

Removed: 




diff  --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp 
b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index c381377b67c9..663ea50c4c02 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6483,9 +6483,10 @@ LoopVectorizationCostModel::expectedCost(ElementCount 
VF) {
 // if-converted. This means that the block's instructions (aside from
 // stores and instructions that may divide by zero) will now be
 // unconditionally executed. For the scalar case, we may not always execute
-// the predicated block. Thus, scale the block's cost by the probability of
-// executing it.
-if (VF.isScalar() && blockNeedsPredication(BB))
+// the predicated block, if it is an if-else block. Thus, scale the block's
+// cost by the probability of executing it. blockNeedsPredication from
+// Legal is used so as to not include all blocks in tail folded loops.
+if (VF.isScalar() && Legal->blockNeedsPredication(BB))
   BlockCost.first /= getReciprocalPredBlockProb();
 
 Cost.first += BlockCost.first;

diff  --git a/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll 
b/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll
index 959fbe676e67..fc8ea4fc938c 100644
--- a/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll
@@ -15,7 +15,7 @@ define void @pred_loop(i32* %off, i32* %data, i32* %dst, i32 
%n) #0 {
 ; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 store i32 %add1, i32* %arrayidx2, align 4
 ; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 %exitcond.not = icmp eq i32 %add, %n
 ; CHECK-COST-NEXT: LV: Found an estimated cost of 0 for VF 1 For instruction:  
 br i1 %exitcond.not, label %exit.loopexit, label %for.body
-; CHECK-COST-NEXT: LV: Scalar loop costs: 2.
+; CHECK-COST-NEXT: LV: Scalar loop costs: 5.
 
 entry:
   %cmp8 = icmp sgt i32 %n, 0



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] f6e885a - [ARM] Test for showing scalar vector costs. NFC

2020-12-12 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-12T11:43:14Z
New Revision: f6e885ad2a94357f6f4d18ddf26a8111b3df7ed3

URL: 
https://github.com/llvm/llvm-project/commit/f6e885ad2a94357f6f4d18ddf26a8111b3df7ed3
DIFF: 
https://github.com/llvm/llvm-project/commit/f6e885ad2a94357f6f4d18ddf26a8111b3df7ed3.diff

LOG: [ARM] Test for showing scalar vector costs. NFC

Added: 
llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll

Modified: 


Removed: 




diff  --git a/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll 
b/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll
new file mode 100644
index ..959fbe676e67
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/ARM/scalar-block-cost.ll
@@ -0,0 +1,101 @@
+; RUN: opt -loop-vectorize -debug-only=loop-vectorize 
-enable-arm-maskedgatscat -tail-predication=force-enabled -disable-output < %s 
2>&1 | FileCheck %s --check-prefixes=CHECK-COST,CHECK-COST-2
+; REQUIRES: asserts
+
+target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
+target triple = "thumbv8.1m.main-none-none-eabi"
+
+define void @pred_loop(i32* %off, i32* %data, i32* %dst, i32 %n) #0 {
+
+; CHECK-COST: LV: Found an estimated cost of 0 for VF 1 For instruction:   
%i.09 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
+; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 %add = add nuw nsw i32 %i.09, 1
+; CHECK-COST-NEXT: LV: Found an estimated cost of 0 for VF 1 For instruction:  
 %arrayidx = getelementptr inbounds i32, i32* %data, i32 %add
+; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 %0 = load i32, i32* %arrayidx, align 4
+; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 %add1 = add nsw i32 %0, 5
+; CHECK-COST-NEXT: LV: Found an estimated cost of 0 for VF 1 For instruction:  
 %arrayidx2 = getelementptr inbounds i32, i32* %dst, i32 %i.09
+; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 store i32 %add1, i32* %arrayidx2, align 4
+; CHECK-COST-NEXT: LV: Found an estimated cost of 1 for VF 1 For instruction:  
 %exitcond.not = icmp eq i32 %add, %n
+; CHECK-COST-NEXT: LV: Found an estimated cost of 0 for VF 1 For instruction:  
 br i1 %exitcond.not, label %exit.loopexit, label %for.body
+; CHECK-COST-NEXT: LV: Scalar loop costs: 2.
+
+entry:
+  %cmp8 = icmp sgt i32 %n, 0
+  br i1 %cmp8, label %for.body, label %exit
+
+exit: ; preds = %for.body, %entry
+  ret void
+
+for.body: ; preds = %entry, %for.body
+  %i.09 = phi i32 [ %add, %for.body ], [ 0, %entry ]
+  %add = add nuw nsw i32 %i.09, 1
+  %arrayidx = getelementptr inbounds i32, i32* %data, i32 %add
+  %0 = load i32, i32* %arrayidx, align 4
+  %add1 = add nsw i32 %0, 5
+  %arrayidx2 = getelementptr inbounds i32, i32* %dst, i32 %i.09
+  store i32 %add1, i32* %arrayidx2, align 4
+  %exitcond.not = icmp eq i32 %add, %n
+  br i1 %exitcond.not, label %exit, label %for.body
+}
+
+define i32 @if_convert(i32* %a, i32* %b, i32 %start, i32 %end) #0 {
+
+; CHECK-COST-2: LV: Found an estimated cost of 0 for VF 1 For instruction:   
%i.032 = phi i32 [ %inc, %if.end ], [ %start, %for.body.preheader ]
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 0 for VF 1 For 
instruction:   %arrayidx = getelementptr inbounds i32, i32* %a, i32 %i.032
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %0 = load i32, i32* %arrayidx, align 4
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 0 for VF 1 For 
instruction:   %arrayidx2 = getelementptr inbounds i32, i32* %b, i32 %i.032
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %1 = load i32, i32* %arrayidx2, align 4
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %cmp3 = icmp sgt i32 %0, %1
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 0 for VF 1 For 
instruction:   br i1 %cmp3, label %if.then, label %if.end
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %mul = mul nsw i32 %0, 5
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %add = add nsw i32 %mul, 3
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 0 for VF 1 For 
instruction:   %factor = shl i32 %add, 1
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %sub = sub i32 %0, %1
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   %add7 = add i32 %sub, %factor
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 1 for VF 1 For 
instruction:   store i32 %add7, i32* %arrayidx2, align 4
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 0 for VF 1 For 
instruction:   br label %if.end
+; CHECK-COST-2-NEXT: LV: Found an estimated cost of 0 for VF 1 For 
instruction:   %k.0 = phi i32 [ 

[llvm-branch-commits] [llvm] 3f571be - [ARM] Make t2DoLoopStartTP a terminator

2020-12-11 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-11T09:23:57Z
New Revision: 3f571be1c07b48846f9c1ff31c088b00c3ef1f13

URL: 
https://github.com/llvm/llvm-project/commit/3f571be1c07b48846f9c1ff31c088b00c3ef1f13
DIFF: 
https://github.com/llvm/llvm-project/commit/3f571be1c07b48846f9c1ff31c088b00c3ef1f13.diff

LOG: [ARM] Make t2DoLoopStartTP a terminator

Although this was something that I was hoping we would not have to do,
this patch makes t2DoLoopStartTP a terminator in order to keep it at the
end of it's block, so not allowing extra MVE instruction between it and
the end. With t2DoLoopStartTP's also starting tail predication regions,
it also marks them as having side effects. The t2DoLoopStart is still
not a terminator, giving it the extra scheduling freedom that can be
helpful, but now that we have a TP version they can be treated
differently.

Differential Revision: https://reviews.llvm.org/D91887

Added: 


Modified: 
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
llvm/lib/Target/ARM/ARMInstrThumb2.td
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/mov-operand.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/reductions.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/while-loops.ll
llvm/test/CodeGen/Thumb2/mve-fma-loops.ll
llvm/test/CodeGen/Thumb2/mve-gather-scatter-tailpred.ll
llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll
llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll
llvm/test/CodeGen/Thumb2/mve-pred-vctpvpsel.ll
llvm/test/CodeGen/Thumb2/mve-vecreduce-loops.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
index 45c2b5d32ae4..df237dffe4fb 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
@@ -361,7 +361,8 @@ class ARMBaseInstrInfo : public ARMGenInstrInfo {
   bool shouldOutlineFromFunctionByDefault(MachineFunction ) const override;
 
   bool isUnspillableTerminatorImpl(const MachineInstr *MI) const override {
-return MI->getOpcode() == ARM::t2LoopEndDec;
+return MI->getOpcode() == ARM::t2LoopEndDec ||
+   MI->getOpcode() == ARM::t2DoLoopStartTP;
   }
 
 private:

diff  --git a/llvm/lib/Target/ARM/ARMInstrThumb2.td 
b/llvm/lib/Target/ARM/ARMInstrThumb2.td
index caae58443bec..52da88dab632 100644
--- a/llvm/lib/Target/ARM/ARMInstrThumb2.td
+++ b/llvm/lib/Target/ARM/ARMInstrThumb2.td
@@ -5427,6 +5427,7 @@ def t2DoLoopStart :
   t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts), 4, IIC_Br,
   [(set GPRlr:$X, (int_start_loop_iterations rGPR:$elts))]>;
 
+let isTerminator = 1, hasSideEffects = 1 in
 def t2DoLoopStartTP :
   t2PseudoInst<(outs GPRlr:$X), (ins rGPR:$elts, rGPR:$count), 4, IIC_Br, []>;
 

diff  --git a/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp 
b/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
index 62f23cf49073..00e4449769f4 100644
--- a/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
+++ b/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
@@ -341,13 +341,10 @@ bool MVEVPTOptimisations::ConvertTailPredLoop(MachineLoop 
*ML,
   for (MachineInstr  :
MRI->use_instructions(LoopStart->getOperand(0).getReg()))
 if ((InsertPt != MBB->end() && !DT->dominates(&*InsertPt, )) ||
-!DT->dominates(ML->getHeader(), Use.getParent()))
-  InsertPt = 
-  if (InsertPt != MBB->end() &&
-  !DT->dominates(MRI->getVRegDef(CountReg), &*InsertPt)) {
-LLVM_DEBUG(dbgs() << "  InsertPt does not dominate CountReg!\n");
-return false;
-  }
+!DT->dominates(ML->getHeader(), Use.getParent())) {
+  LLVM_DEBUG(dbgs() << "  InsertPt could not be a terminator!\n");
+  return false;
+}
 
   MachineInstrBuilder MI = BuildMI(*MBB, InsertPt, LoopStart->getDebugLoc(),
TII->get(ARM::t2DoLoopStartTP))

diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll
index dcb5afca5d4b..598775e74a8c 100644
--- a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/exitcount.ll
@@ -7,15 +7,15 @@ define void @foo(%struct.SpeexPreprocessState_* nocapture 
readonly %st, i16* %x)
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:.save {r4, lr}
 ; CHECK-NEXT:push {r4, lr}
-; CHECK-NEXT:ldrd r12, r4, [r0]
-; CHECK-NEXT:ldrd r2, r3, [r0, #8]
-; CHECK-NEXT:rsb r12, r12, r4, lsl #1
-; CHECK-NEXT:mov r4, r12
+; CHECK-NEXT:ldrd r12, r2, [r0]
+; CHECK-NEXT:ldrd r4, r3, [r0, #8]
+; CHECK-NEXT:rsb r12, r12, r2, lsl #1
+; CHECK-NEXT:mov r2, r12
 ; CHECK-NEXT:dlstp.16 lr, r12
 ; CHECK-NEXT:  .LBB0_1: @ %do.body
 ; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
 ; CHECK-NEXT:vldrh.u16 q0, [r3], #16
-; CHECK-NEXT:vstrh.16 q0, [r2], #16
+; CHECK-NEXT:vstrh.16 q0, [r4], #16
 ; 

[llvm-branch-commits] [llvm] 0447f35 - [ARM][RegAlloc] Add t2LoopEndDec

2020-12-10 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-10T12:14:23Z
New Revision: 0447f3508f0217e06b4acaaec0937091d071100a

URL: 
https://github.com/llvm/llvm-project/commit/0447f3508f0217e06b4acaaec0937091d071100a
DIFF: 
https://github.com/llvm/llvm-project/commit/0447f3508f0217e06b4acaaec0937091d071100a.diff

LOG: [ARM][RegAlloc] Add t2LoopEndDec

We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.

Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.

There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.

This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.

Differential Revision: https://reviews.llvm.org/D91358

Added: 


Modified: 
llvm/include/llvm/CodeGen/TargetInstrInfo.h
llvm/lib/CodeGen/CalcSpillWeights.cpp
llvm/lib/CodeGen/MachineVerifier.cpp
llvm/lib/CodeGen/PHIElimination.cpp
llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
llvm/lib/Target/ARM/ARMInstrThumb2.td
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/count_dominates_start.mir
llvm/test/CodeGen/Thumb2/LowOverheadLoops/fast-fp-loops.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/minloop.ll
llvm/test/CodeGen/Thumb2/LowOverheadLoops/mve-float-loops.ll
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/Thumb2/mve-postinc-dct.ll
llvm/test/CodeGen/Thumb2/mve-postinc-lsr.ll
llvm/test/CodeGen/Thumb2/mve-satmul-loops.ll
llvm/test/CodeGen/Thumb2/mve-vldshuffle.ll

Removed: 




diff  --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h 
b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 68fc129cc0ed..d7a0e47d3bb5 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -348,6 +348,12 @@ class TargetInstrInfo : public MCInstrInfo {
  unsigned , unsigned ,
  const MachineFunction ) const;
 
+  /// Return true if the given instruction is terminator that is unspillable,
+  /// according to isUnspillableTerminatorImpl.
+  bool isUnspillableTerminator(const MachineInstr *MI) const {
+return MI->isTerminator() && isUnspillableTerminatorImpl(MI);
+  }
+
   /// Returns the size in bytes of the specified MachineInstr, or ~0U
   /// when this function is not implemented by a target.
   virtual unsigned getInstSizeInBytes(const MachineInstr ) const {
@@ -954,6 +960,17 @@ class TargetInstrInfo : public MCInstrInfo {
 return None;
   }
 
+  /// Return true if the given terminator MI is not expected to spill. This
+  /// sets the live interval as not spillable and adjusts phi node lowering to
+  /// not introduce copies after the terminator. Use with care, these are
+  /// currently used for hardware loop intrinsics in very controlled 
situations,
+  /// created prior to registry allocation in loops that only have single phi
+  /// users for the terminators value. They may run out of registers if not 
used
+  /// carefully.
+  virtual bool isUnspillableTerminatorImpl(const MachineInstr *MI) const {
+return false;
+  }
+
 public:
   /// If the specific machine instruction is a instruction that moves/copies
   /// value from one register to another register return destination and source

diff  --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp 
b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index bf31441c37bb..16f380c1eb62 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -142,6 +142,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval , 
SlotIndex *Start,
SlotIndex *End) {
   MachineRegisterInfo  = 

[llvm-branch-commits] [llvm] 5abbf20 - [ARM] Additional test for Min loop. NFC

2020-12-10 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-10T10:49:00Z
New Revision: 5abbf20f0fe5a1fed0d455bc682ca20d0eb651f7

URL: 
https://github.com/llvm/llvm-project/commit/5abbf20f0fe5a1fed0d455bc682ca20d0eb651f7
DIFF: 
https://github.com/llvm/llvm-project/commit/5abbf20f0fe5a1fed0d455bc682ca20d0eb651f7.diff

LOG: [ARM] Additional test for Min loop. NFC

Added: 
llvm/test/CodeGen/Thumb2/LowOverheadLoops/minloop.ll

Modified: 


Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/LowOverheadLoops/minloop.ll 
b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/minloop.ll
new file mode 100644
index 0..9899417fb4d8e
--- /dev/null
+++ b/llvm/test/CodeGen/Thumb2/LowOverheadLoops/minloop.ll
@@ -0,0 +1,193 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve 
-verify-machineinstrs %s -o - | FileCheck %s
+
+define void @arm_min_q31(i32* nocapture readonly %pSrc, i32 %blockSize, i32* 
nocapture %pResult, i32* nocapture %pIndex) {
+; CHECK-LABEL: arm_min_q31:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r4, r5, r6, r7, r8, r9, r10, r11, lr}
+; CHECK-NEXT:push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}
+; CHECK-NEXT:.pad #4
+; CHECK-NEXT:sub sp, #4
+; CHECK-NEXT:ldr.w r12, [r0]
+; CHECK-NEXT:subs.w r9, r1, #1
+; CHECK-NEXT:beq .LBB0_3
+; CHECK-NEXT:  @ %bb.1: @ %while.body.preheader
+; CHECK-NEXT:subs r6, r1, #2
+; CHECK-NEXT:and r7, r9, #3
+; CHECK-NEXT:cmp r6, #3
+; CHECK-NEXT:str r7, [sp] @ 4-byte Spill
+; CHECK-NEXT:bhs .LBB0_4
+; CHECK-NEXT:  @ %bb.2:
+; CHECK-NEXT:mov.w r8, #0
+; CHECK-NEXT:b .LBB0_6
+; CHECK-NEXT:  .LBB0_3:
+; CHECK-NEXT:mov.w r8, #0
+; CHECK-NEXT:b .LBB0_10
+; CHECK-NEXT:  .LBB0_4: @ %while.body.preheader.new
+; CHECK-NEXT:bic r6, r9, #3
+; CHECK-NEXT:movs r4, #1
+; CHECK-NEXT:subs r6, #4
+; CHECK-NEXT:mov.w r8, #0
+; CHECK-NEXT:add.w lr, r4, r6, lsr #2
+; CHECK-NEXT:movs r6, #4
+; CHECK-NEXT:mov lr, lr
+; CHECK-NEXT:mov r11, lr
+; CHECK-NEXT:  .LBB0_5: @ %while.body
+; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:ldr r10, [r0, #16]!
+; CHECK-NEXT:mov lr, r11
+; CHECK-NEXT:sub.w lr, lr, #1
+; CHECK-NEXT:sub.w r9, r9, #4
+; CHECK-NEXT:ldrd r7, r5, [r0, #-12]
+; CHECK-NEXT:mov r11, lr
+; CHECK-NEXT:ldr r4, [r0, #-4]
+; CHECK-NEXT:cmp r12, r7
+; CHECK-NEXT:it gt
+; CHECK-NEXT:subgt.w r8, r6, #3
+; CHECK-NEXT:csel r7, r7, r12, gt
+; CHECK-NEXT:cmp r7, r5
+; CHECK-NEXT:it gt
+; CHECK-NEXT:subgt.w r8, r6, #2
+; CHECK-NEXT:csel r7, r5, r7, gt
+; CHECK-NEXT:cmp r7, r4
+; CHECK-NEXT:it gt
+; CHECK-NEXT:subgt.w r8, r6, #1
+; CHECK-NEXT:csel r7, r4, r7, gt
+; CHECK-NEXT:cmp r7, r10
+; CHECK-NEXT:csel r8, r6, r8, gt
+; CHECK-NEXT:add.w r6, r6, #4
+; CHECK-NEXT:csel r12, r10, r7, gt
+; CHECK-NEXT:cmp.w lr, #0
+; CHECK-NEXT:bne .LBB0_5
+; CHECK-NEXT:b .LBB0_6
+; CHECK-NEXT:  .LBB0_6: @ %while.end.loopexit.unr-lcssa
+; CHECK-NEXT:ldr r7, [sp] @ 4-byte Reload
+; CHECK-NEXT:cbz r7, .LBB0_10
+; CHECK-NEXT:  @ %bb.7: @ %while.body.epil
+; CHECK-NEXT:ldr r4, [r0, #4]
+; CHECK-NEXT:sub.w r1, r1, r9
+; CHECK-NEXT:cmp r12, r4
+; CHECK-NEXT:csel r8, r1, r8, gt
+; CHECK-NEXT:csel r12, r4, r12, gt
+; CHECK-NEXT:cmp r7, #1
+; CHECK-NEXT:beq .LBB0_10
+; CHECK-NEXT:  @ %bb.8: @ %while.body.epil.1
+; CHECK-NEXT:ldr r4, [r0, #8]
+; CHECK-NEXT:cmp r12, r4
+; CHECK-NEXT:csinc r8, r8, r1, le
+; CHECK-NEXT:csel r12, r4, r12, gt
+; CHECK-NEXT:cmp r7, #2
+; CHECK-NEXT:beq .LBB0_10
+; CHECK-NEXT:  @ %bb.9: @ %while.body.epil.2
+; CHECK-NEXT:ldr r0, [r0, #12]
+; CHECK-NEXT:cmp r12, r0
+; CHECK-NEXT:it gt
+; CHECK-NEXT:addgt.w r8, r1, #2
+; CHECK-NEXT:csel r12, r0, r12, gt
+; CHECK-NEXT:  .LBB0_10: @ %while.end
+; CHECK-NEXT:str.w r12, [r2]
+; CHECK-NEXT:str.w r8, [r3]
+; CHECK-NEXT:add sp, #4
+; CHECK-NEXT:pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}
+entry:
+  %0 = load i32, i32* %pSrc, align 4
+  %blkCnt.015 = add i32 %blockSize, -1
+  %cmp.not17 = icmp eq i32 %blkCnt.015, 0
+  br i1 %cmp.not17, label %while.end, label %while.body.preheader
+
+while.body.preheader: ; preds = %entry
+  %1 = add i32 %blockSize, -2
+  %xtraiter = and i32 %blkCnt.015, 3
+  %2 = icmp ult i32 %1, 3
+  br i1 %2, label %while.end.loopexit.unr-lcssa, label 
%while.body.preheader.new
+
+while.body.preheader.new: ; preds = 
%while.body.preheader
+  %unroll_iter = and i32 %blkCnt.015, -4
+  br label %while.body
+
+while.body:   ; preds = %while.body, 
%while.body.preheader.new
+  %pSrc.addr.021.pn = phi i32* [ %pSrc, %while.body.preheader.new ], [ 
%pSrc.addr.021.3, %while.body ]
+  

[llvm-branch-commits] [llvm] b0ce615 - [ARM] Remove copies from low overhead phi inductions.

2020-12-10 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-10T10:30:31Z
New Revision: b0ce615b2d29524b0b3541d07dd561665b710e79

URL: 
https://github.com/llvm/llvm-project/commit/b0ce615b2d29524b0b3541d07dd561665b710e79
DIFF: 
https://github.com/llvm/llvm-project/commit/b0ce615b2d29524b0b3541d07dd561665b710e79.diff

LOG: [ARM] Remove copies from low overhead phi inductions.

The phi created in a low overhead loop gets created with a default
register class it seems. There are then copied inserted between the low
overhead loop pseudo instructions (which produce/consume GPRlr
instructions) and the phi holding the induction. This patch removes
those as a step towards attempting to make t2LoopDec and t2LoopEnd a
single instruction, and appears useful in it's own right as shown in the
tests.

Differential Revision: https://reviews.llvm.org/D91267

Added: 


Modified: 
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/count_dominates_start.mir
llvm/test/CodeGen/Thumb2/mve-fma-loops.ll
llvm/test/CodeGen/Thumb2/mve-gather-scatter-tailpred.ll
llvm/test/CodeGen/Thumb2/mve-pred-vctpvpsel.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp 
b/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
index e56c4ce36f7b..20cb98072c9a 100644
--- a/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
+++ b/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
@@ -59,7 +59,7 @@ class MVEVPTOptimisations : public MachineFunctionPass {
   }
 
 private:
-  bool RevertLoopWithCall(MachineLoop *ML);
+  bool MergeLoopEnd(MachineLoop *ML);
   bool ConvertTailPredLoop(MachineLoop *ML, MachineDominatorTree *DT);
   MachineInstr (MachineBasicBlock ,
 MachineInstr ,
@@ -159,8 +159,15 @@ static bool findLoopComponents(MachineLoop *ML, 
MachineRegisterInfo *MRI,
   return true;
 }
 
-bool MVEVPTOptimisations::RevertLoopWithCall(MachineLoop *ML) {
-  LLVM_DEBUG(dbgs() << "RevertLoopWithCall on loop " << 
ML->getHeader()->getName()
+// This function converts loops with t2LoopEnd and t2LoopEnd instructions into
+// a single t2LoopEndDec instruction. To do that it needs to make sure that LR
+// will be valid to be used for the low overhead loop, which means nothing else
+// is using LR (especially calls) and there are no superfluous copies in the
+// loop. The t2LoopEndDec is a branching terminator that produces a value (the
+// decrement) around the loop edge, which means we need to be careful that they
+// will be valid to allocate without any spilling.
+bool MVEVPTOptimisations::MergeLoopEnd(MachineLoop *ML) {
+  LLVM_DEBUG(dbgs() << "MergeLoopEnd on loop " << ML->getHeader()->getName()
 << "\n");
 
   MachineInstr *LoopEnd, *LoopPhi, *LoopStart, *LoopDec;
@@ -181,7 +188,58 @@ bool MVEVPTOptimisations::RevertLoopWithCall(MachineLoop 
*ML) {
 }
   }
 
-  return false;
+  // Remove any copies from the loop, to ensure the phi that remains is both
+  // simpler and contains no extra uses. Because t2LoopEndDec is a terminator
+  // that cannot spill, we need to be careful what remains in the loop.
+  Register PhiReg = LoopPhi->getOperand(0).getReg();
+  Register DecReg = LoopDec->getOperand(0).getReg();
+  Register StartReg = LoopStart->getOperand(0).getReg();
+  // Ensure the uses are expected, and collect any copies we want to remove.
+  SmallVector Copies;
+  auto CheckUsers = [](Register BaseReg,
+  ArrayRef ExpectedUsers,
+  MachineRegisterInfo *MRI) {
+SmallVector Worklist;
+Worklist.push_back(BaseReg);
+while (!Worklist.empty()) {
+  Register Reg = Worklist.pop_back_val();
+  for (MachineInstr  : MRI->use_nodbg_instructions(Reg)) {
+if (count(ExpectedUsers, ))
+  continue;
+if (MI.getOpcode() != TargetOpcode::COPY ||
+!MI.getOperand(0).getReg().isVirtual()) {
+  LLVM_DEBUG(dbgs() << "Extra users of register found: " << MI);
+  return false;
+}
+Worklist.push_back(MI.getOperand(0).getReg());
+Copies.push_back();
+  }
+}
+return true;
+  };
+  if (!CheckUsers(PhiReg, {LoopDec}, MRI) ||
+  !CheckUsers(DecReg, {LoopPhi, LoopEnd}, MRI) ||
+  !CheckUsers(StartReg, {LoopPhi}, MRI))
+return false;
+
+  MRI->constrainRegClass(StartReg, ::GPRlrRegClass);
+  MRI->constrainRegClass(PhiReg, ::GPRlrRegClass);
+  MRI->constrainRegClass(DecReg, ::GPRlrRegClass);
+
+  if (LoopPhi->getOperand(2).getMBB() == ML->getLoopLatch()) {
+LoopPhi->getOperand(3).setReg(StartReg);
+LoopPhi->getOperand(1).setReg(DecReg);
+  } else {
+LoopPhi->getOperand(1).setReg(StartReg);
+LoopPhi->getOperand(3).setReg(DecReg);
+  }
+
+  LoopDec->getOperand(1).setReg(PhiReg);
+  LoopEnd->getOperand(0).setReg(DecReg);
+
+  for (auto *MI : Copies)
+MI->eraseFromParent();
+  return true;
 }
 
 

[llvm-branch-commits] [llvm] eec5b99 - [ARM] MVE vcreate tests, for dual lane moves. NFC

2020-12-10 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-10T09:17:34Z
New Revision: eec5b99901826852f13e11e7f807e175d434f1cd

URL: 
https://github.com/llvm/llvm-project/commit/eec5b99901826852f13e11e7f807e175d434f1cd
DIFF: 
https://github.com/llvm/llvm-project/commit/eec5b99901826852f13e11e7f807e175d434f1cd.diff

LOG: [ARM] MVE vcreate tests, for dual lane moves. NFC

Added: 
llvm/test/CodeGen/Thumb2/mve-vcreate.ll

Modified: 
llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll

Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
index 475a392b7c1c..0259cd6770ad 100644
--- a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
@@ -367,7 +367,6 @@ define arm_aapcs_vfpcc i32 @const_mask_threepredabab(<4 x 
i32> %0, <4 x i32> %1,
 }
 
 
-
 declare i32 @llvm.arm.mve.pred.v2i.v4i1(<4 x i1>)
 declare i32 @llvm.arm.mve.pred.v2i.v8i1(<8 x i1>)
 declare i32 @llvm.arm.mve.pred.v2i.v16i1(<16 x i1>)

diff  --git a/llvm/test/CodeGen/Thumb2/mve-vcreate.ll 
b/llvm/test/CodeGen/Thumb2/mve-vcreate.ll
new file mode 100644
index ..e408bc46b47a
--- /dev/null
+++ b/llvm/test/CodeGen/Thumb2/mve-vcreate.ll
@@ -0,0 +1,482 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve 
-verify-machineinstrs %s -o - | FileCheck %s
+
+define arm_aapcs_vfpcc <4 x i32> @vcreate_i32(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: vcreate_i32:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.32 q0[0], r1
+; CHECK-NEXT:vmov.32 q0[1], r0
+; CHECK-NEXT:vmov.32 q0[2], r3
+; CHECK-NEXT:vmov.32 q0[3], r2
+; CHECK-NEXT:bx lr
+entry:
+  %conv = zext i32 %a to i64
+  %shl = shl nuw i64 %conv, 32
+  %conv1 = zext i32 %b to i64
+  %or = or i64 %shl, %conv1
+  %0 = insertelement <2 x i64> undef, i64 %or, i64 0
+  %conv2 = zext i32 %c to i64
+  %shl3 = shl nuw i64 %conv2, 32
+  %conv4 = zext i32 %d to i64
+  %or5 = or i64 %shl3, %conv4
+  %1 = insertelement <2 x i64> %0, i64 %or5, i64 1
+  %2 = bitcast <2 x i64> %1 to <4 x i32>
+  ret <4 x i32> %2
+}
+
+define arm_aapcs_vfpcc <4 x i32> @insert_0123(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: insert_0123:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.32 q0[0], r0
+; CHECK-NEXT:vmov.32 q0[1], r1
+; CHECK-NEXT:vmov.32 q0[2], r2
+; CHECK-NEXT:vmov.32 q0[3], r3
+; CHECK-NEXT:bx lr
+entry:
+  %v1 = insertelement <4 x i32> undef, i32 %a, i32 0
+  %v2 = insertelement <4 x i32> %v1, i32 %b, i32 1
+  %v3 = insertelement <4 x i32> %v2, i32 %c, i32 2
+  %v4 = insertelement <4 x i32> %v3, i32 %d, i32 3
+  ret <4 x i32> %v4
+}
+
+define arm_aapcs_vfpcc <4 x i32> @insert_3210(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: insert_3210:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.32 q0[0], r3
+; CHECK-NEXT:vmov.32 q0[1], r2
+; CHECK-NEXT:vmov.32 q0[2], r1
+; CHECK-NEXT:vmov.32 q0[3], r0
+; CHECK-NEXT:bx lr
+entry:
+  %v1 = insertelement <4 x i32> undef, i32 %a, i32 3
+  %v2 = insertelement <4 x i32> %v1, i32 %b, i32 2
+  %v3 = insertelement <4 x i32> %v2, i32 %c, i32 1
+  %v4 = insertelement <4 x i32> %v3, i32 %d, i32 0
+  ret <4 x i32> %v4
+}
+
+define arm_aapcs_vfpcc <4 x i32> @insert_0213(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: insert_0213:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.32 q0[0], r0
+; CHECK-NEXT:vmov.32 q0[1], r2
+; CHECK-NEXT:vmov.32 q0[2], r1
+; CHECK-NEXT:vmov.32 q0[3], r3
+; CHECK-NEXT:bx lr
+entry:
+  %v1 = insertelement <4 x i32> undef, i32 %a, i32 0
+  %v2 = insertelement <4 x i32> %v1, i32 %b, i32 2
+  %v3 = insertelement <4 x i32> %v2, i32 %c, i32 1
+  %v4 = insertelement <4 x i32> %v3, i32 %d, i32 3
+  ret <4 x i32> %v4
+}
+
+define arm_aapcs_vfpcc <4 x i32> @insert_0220(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: insert_0220:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.32 q0[0], r3
+; CHECK-NEXT:vmov.32 q0[2], r2
+; CHECK-NEXT:bx lr
+entry:
+  %v1 = insertelement <4 x i32> undef, i32 %a, i32 0
+  %v2 = insertelement <4 x i32> %v1, i32 %b, i32 2
+  %v3 = insertelement <4 x i32> %v2, i32 %c, i32 2
+  %v4 = insertelement <4 x i32> %v3, i32 %d, i32 0
+  ret <4 x i32> %v4
+}
+
+define arm_aapcs_vfpcc <4 x i32> @insert_321(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: insert_321:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmov.32 q0[1], r2
+; CHECK-NEXT:vmov.32 q0[2], r1
+; CHECK-NEXT:vmov.32 q0[3], r0
+; CHECK-NEXT:bx lr
+entry:
+  %v1 = insertelement <4 x i32> undef, i32 %a, i32 3
+  %v2 = insertelement <4 x i32> %v1, i32 %b, i32 2
+  %v3 = insertelement <4 x i32> %v2, i32 %c, i32 1
+  ret <4 x i32> %v3
+}
+
+define arm_aapcs_vfpcc <4 x i32> @insert_310(i32 %a, i32 %b, i32 %c, i32 %d) {
+; CHECK-LABEL: insert_310:
+; CHECK:   @ 

[llvm-branch-commits] [llvm] 384383e - [ARM] Common inverse constant predicates to VPNOT

2020-12-09 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-09T07:56:44Z
New Revision: 384383e15c177cd0dddae6b0999e527663fb3e22

URL: 
https://github.com/llvm/llvm-project/commit/384383e15c177cd0dddae6b0999e527663fb3e22
DIFF: 
https://github.com/llvm/llvm-project/commit/384383e15c177cd0dddae6b0999e527663fb3e22.diff

LOG: [ARM] Common inverse constant predicates to VPNOT

This scans through blocks looking for constants used as predicates in
MVE instructions. When two constants are found which are the inverse of
one another, the second can be replaced by a VPNOT of the first,
potentially allowing that not to be folded away into an else predicate
of a vpt block.

Differential Revision: https://reviews.llvm.org/D92470

Added: 


Modified: 
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
llvm/test/CodeGen/Thumb2/mve-vpt-optimisations.mir

Removed: 




diff  --git a/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp 
b/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
index ee3821d34025..e56c4ce36f7b 100644
--- a/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
+++ b/llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
@@ -67,6 +67,7 @@ class MVEVPTOptimisations : public MachineFunctionPass {
 Register Target);
   bool ReduceOldVCCRValueUses(MachineBasicBlock );
   bool ReplaceVCMPsByVPNOTs(MachineBasicBlock );
+  bool ReplaceConstByVPNOTs(MachineBasicBlock , MachineDominatorTree *DT);
   bool ConvertVPSEL(MachineBasicBlock );
 };
 
@@ -646,6 +647,90 @@ bool 
MVEVPTOptimisations::ReplaceVCMPsByVPNOTs(MachineBasicBlock ) {
   return !DeadInstructions.empty();
 }
 
+bool MVEVPTOptimisations::ReplaceConstByVPNOTs(MachineBasicBlock ,
+   MachineDominatorTree *DT) {
+  // Scan through the block, looking for instructions that use constants moves
+  // into VPR that are the negative of one another. These are expected to be
+  // COPY's to VCCRRegClass, from a t2MOVi or t2MOVi16. The last seen constant
+  // mask is kept it or and VPNOT's of it are added or reused as we scan 
through
+  // the function.
+  unsigned LastVPTImm = 0;
+  Register LastVPTReg = 0;
+  SmallSet DeadInstructions;
+
+  for (MachineInstr  : MBB.instrs()) {
+// Look for predicated MVE instructions.
+int PIdx = llvm::findFirstVPTPredOperandIdx(Instr);
+if (PIdx == -1)
+  continue;
+Register VPR = Instr.getOperand(PIdx + 1).getReg();
+if (!VPR.isVirtual())
+  continue;
+
+// From that we are looking for an instruction like %11:vccr = COPY 
%9:rgpr.
+MachineInstr *Copy = MRI->getVRegDef(VPR);
+if (!Copy || Copy->getOpcode() != TargetOpcode::COPY ||
+!Copy->getOperand(1).getReg().isVirtual() ||
+MRI->getRegClass(Copy->getOperand(1).getReg()) == ::VCCRRegClass) {
+  LastVPTReg = 0;
+  continue;
+}
+Register GPR = Copy->getOperand(1).getReg();
+
+// Find the Immediate used by the copy.
+auto getImm = [&](Register GPR) -> unsigned {
+  MachineInstr *Def = MRI->getVRegDef(GPR);
+  if (Def && (Def->getOpcode() == ARM::t2MOVi ||
+  Def->getOpcode() == ARM::t2MOVi16))
+return Def->getOperand(1).getImm();
+  return -1U;
+};
+unsigned Imm = getImm(GPR);
+if (Imm == -1U) {
+  LastVPTReg = 0;
+  continue;
+}
+
+unsigned NotImm = ~Imm & 0x;
+if (LastVPTReg != 0 && LastVPTReg != VPR && LastVPTImm == Imm) {
+  Instr.getOperand(PIdx + 1).setReg(LastVPTReg);
+  if (MRI->use_empty(VPR)) {
+DeadInstructions.insert(Copy);
+if (MRI->hasOneUse(GPR))
+  DeadInstructions.insert(MRI->getVRegDef(GPR));
+  }
+  LLVM_DEBUG(dbgs() << "Reusing predicate: in  " << Instr);
+} else if (LastVPTReg != 0 && LastVPTImm == NotImm) {
+  // We have found the not of a previous constant. Create a VPNot of the
+  // earlier predicate reg and use it instead of the copy.
+  Register NewVPR = MRI->createVirtualRegister(::VCCRRegClass);
+  auto VPNot = BuildMI(MBB, , Instr.getDebugLoc(),
+   TII->get(ARM::MVE_VPNOT), NewVPR)
+   .addReg(LastVPTReg);
+  addUnpredicatedMveVpredNOp(VPNot);
+
+  // Use the new register and check if the def is now dead.
+  Instr.getOperand(PIdx + 1).setReg(NewVPR);
+  if (MRI->use_empty(VPR)) {
+DeadInstructions.insert(Copy);
+if (MRI->hasOneUse(GPR))
+  DeadInstructions.insert(MRI->getVRegDef(GPR));
+  }
+  LLVM_DEBUG(dbgs() << "Adding VPNot: " << *VPNot << "  to replace use at "
+<< Instr);
+  VPR = NewVPR;
+}
+
+LastVPTImm = Imm;
+LastVPTReg = VPR;
+  }
+
+  for (MachineInstr *DI : DeadInstructions)
+DI->eraseFromParent();
+
+  return !DeadInstructions.empty();
+}
+
 // Replace VPSEL with a predicated VMOV in blocks with a 

[llvm-branch-commits] [llvm] 8254d70 - [ARM] Constant Mask VPT block tests. NFC

2020-12-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-09T07:44:49Z
New Revision: 8254d70a38838af9a4bba7d1062f758fa2fc7214

URL: 
https://github.com/llvm/llvm-project/commit/8254d70a38838af9a4bba7d1062f758fa2fc7214
DIFF: 
https://github.com/llvm/llvm-project/commit/8254d70a38838af9a4bba7d1062f758fa2fc7214.diff

LOG: [ARM] Constant Mask VPT block tests. NFC

Added: 


Modified: 
llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll

Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
index 2d5b6ba0cafa..0e80a1c160e0 100644
--- a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
@@ -147,6 +147,271 @@ entry:
 
 
 
+define arm_aapcs_vfpcc i32 @const_mask_1(<4 x i32> %0, <4 x i32> %1, i32 %2) {
+; CHECK-LABEL: const_mask_1:
+; CHECK:   @ %bb.0:
+; CHECK-NEXT:movs r1, #1
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:movw r1, #65534
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:bx lr
+  %4 = tail call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 1)
+  %5 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %0, 
i32 0, <4 x i1> %4)
+  %6 = add i32 %5, %2
+  %7 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %1, 
i32 0, <4 x i1> %4)
+  %8 = add i32 %6, %7
+  %9 = tail call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 65534)
+  %10 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %0, 
i32 0, <4 x i1> %9)
+  %11 = add i32 %8, %10
+  %12 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %1, 
i32 0, <4 x i1> %9)
+  %13 = add i32 %11, %12
+  ret i32 %13
+}
+
+define arm_aapcs_vfpcc i32 @const_mask_not1(<4 x i32> %0, <4 x i32> %1, i32 
%2) {
+; CHECK-LABEL: const_mask_not1:
+; CHECK:   @ %bb.0:
+; CHECK-NEXT:movs r1, #1
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:movw r1, #65533
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:bx lr
+  %4 = tail call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 1)
+  %5 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %0, 
i32 0, <4 x i1> %4)
+  %6 = add i32 %5, %2
+  %7 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %1, 
i32 0, <4 x i1> %4)
+  %8 = add i32 %6, %7
+  %9 = tail call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 65533)
+  %10 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %0, 
i32 0, <4 x i1> %9)
+  %11 = add i32 %8, %10
+  %12 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %1, 
i32 0, <4 x i1> %9)
+  %13 = add i32 %11, %12
+  ret i32 %13
+}
+
+define arm_aapcs_vfpcc i32 @const_mask_1234(<4 x i32> %0, <4 x i32> %1, i32 
%2) {
+; CHECK-LABEL: const_mask_1234:
+; CHECK:   @ %bb.0:
+; CHECK-NEXT:movw r1, #1234
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:movw r1, #64301
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:bx lr
+  %4 = tail call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 1234)
+  %5 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %0, 
i32 0, <4 x i1> %4)
+  %6 = add i32 %5, %2
+  %7 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %1, 
i32 0, <4 x i1> %4)
+  %8 = add i32 %6, %7
+  %9 = tail call <4 x i1> @llvm.arm.mve.pred.i2v.v4i1(i32 64301)
+  %10 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %0, 
i32 0, <4 x i1> %9)
+  %11 = add i32 %8, %10
+  %12 = tail call i32 @llvm.arm.mve.addv.predicated.v4i32.v4i1(<4 x i32> %1, 
i32 0, <4 x i1> %9)
+  %13 = add i32 %11, %12
+  ret i32 %13
+}
+
+define arm_aapcs_vfpcc i32 @const_mask_abab(<4 x i32> %0, <4 x i32> %1, i32 
%2) {
+; CHECK-LABEL: const_mask_abab:
+; CHECK:   @ %bb.0:
+; CHECK-NEXT:.pad #8
+; CHECK-NEXT:sub sp, #8
+; CHECK-NEXT:movw r1, #1234
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:movw r1, #64301
+; CHECK-NEXT:vstr p0, [sp, #4] @ 4-byte Spill
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vstr p0, [sp] @ 4-byte Spill
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:vldr p0, [sp, #4] @ 4-byte Reload
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvat.s32 r0, q1
+; CHECK-NEXT:vldr p0, [sp] @ 4-byte Reload
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvat.s32 r0, q0
+; CHECK-NEXT:add sp, #8
+; CHECK-NEXT:bx lr
+  

[llvm-branch-commits] [llvm] 03e675f - [ARM] Turn pred_cast(xor(x, -1)) into xor(pred_cast(x), -1)

2020-12-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-08T15:22:46Z
New Revision: 03e675fd128bed754454fc176357ad0ec6660c47

URL: 
https://github.com/llvm/llvm-project/commit/03e675fd128bed754454fc176357ad0ec6660c47
DIFF: 
https://github.com/llvm/llvm-project/commit/03e675fd128bed754454fc176357ad0ec6660c47.diff

LOG: [ARM] Turn pred_cast(xor(x, -1)) into xor(pred_cast(x), -1)

This folds a not (an xor -1) though a predicate_cast, so that it can be
turned into a VPNOT and potentially be folded away as an else predicate
inside a VPT block.

Differential Revision: https://reviews.llvm.org/D92235

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index bc9222151899..1c6acbcf1a88 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -13848,6 +13848,16 @@ PerformPREDICATE_CASTCombine(SDNode *N, 
TargetLowering::DAGCombinerInfo ) {
 return DCI.DAG.getNode(ARMISD::PREDICATE_CAST, dl, VT, Op->getOperand(0));
   }
 
+  // Turn pred_cast(xor x, -1) into xor(pred_cast x, -1), in order to produce
+  // more VPNOT which might get folded as else predicates.
+  if (Op.getValueType() == MVT::i32 && isBitwiseNot(Op)) {
+SDValue X =
+DCI.DAG.getNode(ARMISD::PREDICATE_CAST, dl, VT, Op->getOperand(0));
+SDValue C = DCI.DAG.getNode(ARMISD::PREDICATE_CAST, dl, VT,
+DCI.DAG.getConstant(65535, dl, MVT::i32));
+return DCI.DAG.getNode(ISD::XOR, dl, VT, X, C);
+  }
+
   // Only the bottom 16 bits of the source register are used.
   if (Op.getValueType() == MVT::i32) {
 APInt DemandedMask = APInt::getLowBitsSet(32, 16);

diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
index 17f57743c301..2d5b6ba0cafa 100644
--- a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
@@ -52,14 +52,11 @@ define arm_aapcs_vfpcc void @const(<8 x i16> %acc0, <8 x 
i16> %acc1, i32* nocapt
 ; CHECK-NEXT:.save {r4, r6, r7, lr}
 ; CHECK-NEXT:push {r4, r6, r7, lr}
 ; CHECK-NEXT:vmsr p0, r1
-; CHECK-NEXT:mvns r1, r1
-; CHECK-NEXT:vpstt
+; CHECK-NEXT:vpsttee
 ; CHECK-NEXT:vaddvt.s16 r12, q1
 ; CHECK-NEXT:vaddvt.s16 r2, q0
-; CHECK-NEXT:vmsr p0, r1
-; CHECK-NEXT:vpstt
-; CHECK-NEXT:vaddvt.s16 r4, q1
-; CHECK-NEXT:vaddvt.s16 r6, q0
+; CHECK-NEXT:vaddve.s16 r4, q1
+; CHECK-NEXT:vaddve.s16 r6, q0
 ; CHECK-NEXT:stm.w r0, {r2, r6, r12}
 ; CHECK-NEXT:str r4, [r0, #12]
 ; CHECK-NEXT:pop {r4, r6, r7, pc}
@@ -88,9 +85,9 @@ entry:
 define arm_aapcs_vfpcc <4 x i32> @xorvpnot_i32(<4 x i32> %acc0, i16 signext 
%p0) {
 ; CHECK-LABEL: xorvpnot_i32:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:mvns r0, r0
-; CHECK-NEXT:vmov.i32 q1, #0x0
 ; CHECK-NEXT:vmsr p0, r0
+; CHECK-NEXT:vmov.i32 q1, #0x0
+; CHECK-NEXT:vpnot
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
 entry:
@@ -104,9 +101,9 @@ entry:
 define arm_aapcs_vfpcc <8 x i16> @xorvpnot_i16(<8 x i16> %acc0, i16 signext 
%p0) {
 ; CHECK-LABEL: xorvpnot_i16:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:mvns r0, r0
-; CHECK-NEXT:vmov.i32 q1, #0x0
 ; CHECK-NEXT:vmsr p0, r0
+; CHECK-NEXT:vmov.i32 q1, #0x0
+; CHECK-NEXT:vpnot
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
 entry:
@@ -120,9 +117,9 @@ entry:
 define arm_aapcs_vfpcc <16 x i8> @xorvpnot_i8(<16 x i8> %acc0, i16 signext 
%p0) {
 ; CHECK-LABEL: xorvpnot_i8:
 ; CHECK:   @ %bb.0: @ %entry
-; CHECK-NEXT:mvns r0, r0
-; CHECK-NEXT:vmov.i32 q1, #0x0
 ; CHECK-NEXT:vmsr p0, r0
+; CHECK-NEXT:vmov.i32 q1, #0x0
+; CHECK-NEXT:vpnot
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
 entry:
@@ -133,6 +130,21 @@ entry:
   ret <16 x i8> %l6
 }
 
+define arm_aapcs_vfpcc <16 x i8> @xorvpnot_i8_2(<16 x i8> %acc0, i32 %p0) {
+; CHECK-LABEL: xorvpnot_i8_2:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:vmsr p0, r0
+; CHECK-NEXT:vmov.i32 q1, #0x0
+; CHECK-NEXT:vpnot
+; CHECK-NEXT:vpsel q0, q0, q1
+; CHECK-NEXT:bx lr
+entry:
+  %l3 = xor i32 %p0, 65535
+  %l5 = tail call <16 x i1> @llvm.arm.mve.pred.i2v.v16i1(i32 %l3)
+  %l6 = select <16 x i1> %l5, <16 x i8> %acc0, <16 x i8> zeroinitializer
+  ret <16 x i8> %l6
+}
+
 
 
 declare i32 @llvm.arm.mve.pred.v2i.v4i1(<4 x i1>)



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] c100d7b - [NFC] Chec[^k] -> Check

2020-12-08 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-08T11:54:39Z
New Revision: c100d7ba36a5501bba6f7107a531323a51498bf6

URL: 
https://github.com/llvm/llvm-project/commit/c100d7ba36a5501bba6f7107a531323a51498bf6
DIFF: 
https://github.com/llvm/llvm-project/commit/c100d7ba36a5501bba6f7107a531323a51498bf6.diff

LOG: [NFC] Chec[^k] -> Check

Some test updates all appearing to use the wrong spelling of CHECK.

Added: 


Modified: 
llvm/test/CodeGen/AArch64/arm64-fold-lsl.ll
llvm/test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll
llvm/test/CodeGen/ARM/fold-stack-adjust.ll
llvm/test/CodeGen/ARM/v7k-abi-align.ll
llvm/test/CodeGen/ARM/vminmaxnm-safe.ll
llvm/test/DebugInfo/COFF/retained-types.ll
llvm/test/MC/ARM/thumb_set-diagnostics.s
llvm/test/MC/Mips/macro-ddiv.s
llvm/test/Transforms/IRCE/pre_post_loops.ll
llvm/test/tools/llvm-readobj/ELF/groups.test
llvm/utils/llvm-compilers-check

Removed: 




diff  --git a/llvm/test/CodeGen/AArch64/arm64-fold-lsl.ll 
b/llvm/test/CodeGen/AArch64/arm64-fold-lsl.ll
index 0790e4c58c46..5e4aae3df836 100644
--- a/llvm/test/CodeGen/AArch64/arm64-fold-lsl.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-fold-lsl.ll
@@ -298,7 +298,7 @@ define i32 @load_doubleword_trunc_word_reuse_shift(i64* 
%ptr, i64 %off) {
 ; CHECK-LABEL: load_doubleword_trunc_word_reuse_shift:
 ; CHECK: lsl x[[REG1:[0-9]+]], x1, #3
 ; CHECK: ldr w[[REG2:[0-9]+]], [x0, x[[REG1]]]
-; CHECL: add w0, w[[REG2]], w[[REG1]]
+; CHECK: add w0, w[[REG2]], w[[REG1]]
 entry:
   %idx = getelementptr inbounds i64, i64* %ptr, i64 %off
   %x = load i64, i64* %idx, align 8

diff  --git a/llvm/test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll 
b/llvm/test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll
index a75dd591dfce..542202ced703 100644
--- a/llvm/test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll
+++ b/llvm/test/CodeGen/ARM/ParallelDSP/inner-full-unroll.ll
@@ -72,7 +72,7 @@ for.body: ; preds = 
%entry, %for.body
 }
 
 ; CHECK-LABEL: full_unroll_sub
-; CHEC: [[IV:%[^ ]+]] = phi i32
+; CHECK: [[IV:%[^ ]+]] = phi i32
 ; CHECK: [[AI:%[^ ]+]] = getelementptr inbounds i32, i32* %a, i32 [[IV]]
 ; CHECK: [[BI:%[^ ]+]] = getelementptr inbounds i16*, i16** %b, i32 [[IV]]
 ; CHECK: [[BIJ:%[^ ]+]] = load i16*, i16** [[BI]], align 4

diff  --git a/llvm/test/CodeGen/ARM/fold-stack-adjust.ll 
b/llvm/test/CodeGen/ARM/fold-stack-adjust.ll
index 6256138e9a02..e22aa882404b 100644
--- a/llvm/test/CodeGen/ARM/fold-stack-adjust.ll
+++ b/llvm/test/CodeGen/ARM/fold-stack-adjust.ll
@@ -1,5 +1,5 @@
 ; Disable shrink-wrapping on the first test otherwise we wouldn't
-; exerce the path for PR18136.
+; exercise the path for PR18136.
 ; RUN: llc -mtriple=thumbv7-apple-none-macho < %s -enable-shrink-wrap=false 
-verify-machineinstrs | FileCheck %s --check-prefixes=CHECK-FNSTART,CHECK
 ; RUN: llc -mtriple=thumbv6m-apple-none-macho -frame-pointer=all < %s 
-verify-machineinstrs | FileCheck %s --check-prefixes=CHECK-FNSTART,CHECK-T1
 ; RUN: llc -mtriple=thumbv6m-apple-none-macho < %s -verify-machineinstrs | 
FileCheck %s --check-prefixes=CHECK-FNSTART,CHECK-T1-NOFP
@@ -30,12 +30,12 @@ define void @check_simple() minsize {
   ; iOS always has a frame pointer and messing with the push affects
   ; how it's set in the prologue. Make sure we get that right.
 ; CHECK-IOS: push {r3, r4, r5, r6, r7, lr}
-; CHECK-NOT: sub sp,
+; CHECK-IOS-NOT: sub sp,
 ; CHECK-IOS: add r7, sp, #16
-; CHECK-NOT: sub sp,
+; CHECK-IOS-NOT: sub sp,
 ; ...
-; CHECK-NOT: add sp,
-; CHEC: pop {r3, r4, r5, r6, r7, pc}
+; CHECK-IOS-NOT: add sp,
+; CHECK-IOS: pop {r0, r1, r2, r3, r7, pc}
 
   %var = alloca i8, i32 16
   call void @bar(i8* %var)

diff  --git a/llvm/test/CodeGen/ARM/v7k-abi-align.ll 
b/llvm/test/CodeGen/ARM/v7k-abi-align.ll
index d7a95c0faa17..be4d876a59ec 100644
--- a/llvm/test/CodeGen/ARM/v7k-abi-align.ll
+++ b/llvm/test/CodeGen/ARM/v7k-abi-align.ll
@@ -4,25 +4,25 @@
 
 define i32 @test_i64_align() "frame-pointer"="all" {
 ; CHECK-LABEL: test_i64_align:
-; CHECL: movs r0, #8
+; CHECK: movs r0, #8
   ret i32 ptrtoint(i64* getelementptr(%struct, %struct* null, i32 0, i32 1) to 
i32)
 }
 
 define i32 @test_f64_align() "frame-pointer"="all" {
 ; CHECK-LABEL: test_f64_align:
-; CHECL: movs r0, #24
+; CHECK: movs r0, #24
   ret i32 ptrtoint(double* getelementptr(%struct, %struct* null, i32 0, i32 3) 
to i32)
 }
 
 define i32 @test_v2f32_align() "frame-pointer"="all" {
 ; CHECK-LABEL: test_v2f32_align:
-; CHECL: movs r0, #40
+; CHECK: movs r0, #40
   ret i32 ptrtoint(<2 x float>* getelementptr(%struct, %struct* null, i32 0, 
i32 5) to i32)
 }
 
 define i32 @test_v4f32_align() "frame-pointer"="all" {
 ; CHECK-LABEL: test_v4f32_align:
-; CHECL: movs r0, #64
+; CHECK: movs r0, #64
   ret i32 ptrtoint(<4 x float>* getelementptr(%struct, %struct* null, i32 0, 
i32 7) to i32)
 }
 

diff  --git a/llvm/test/CodeGen/ARM/vminmaxnm-safe.ll 

[llvm-branch-commits] [llvm] d9bf624 - [ARM] Revert low overhead loops with calls before registry allocation.

2020-12-07 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-07T15:44:40Z
New Revision: d9bf6245bfef41ad7606f0e64e0c4f12d65a2b46

URL: 
https://github.com/llvm/llvm-project/commit/d9bf6245bfef41ad7606f0e64e0c4f12d65a2b46
DIFF: 
https://github.com/llvm/llvm-project/commit/d9bf6245bfef41ad7606f0e64e0c4f12d65a2b46.diff

LOG: [ARM] Revert low overhead loops with calls before registry allocation.

This adds code to revert low overhead loops with calls in them before
register allocation. Ideally we would not create low overhead loops with
calls in them to begin with, but that can be difficult to always get
correct. If we want to try and glue together t2LoopDec and t2LoopEnd
into a single instruction, we need to ensure that no instructions use LR
in the loop. (Technically the final code can be better too, as it
doesn't need to use the same registers but that has not been optimized
for here, as reverting loops with calls is expected to be very rare).

It also adds a MVETailPredUtils.h header to share the revert code
between different passes, and provides a place to expand upon, with
RevertLoopWithCall becoming a place to perform other low overhead loop
alterations like removing copies or combining LoopDec and End into a
single instruction.

Differential Revision: https://reviews.llvm.org/D91273

Added: 
llvm/lib/Target/ARM/MVETailPredUtils.h
llvm/test/CodeGen/Thumb2/LowOverheadLoops/revertcallearly.mir

Modified: 
llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
llvm/lib/Target/ARM/ARMBaseInstrInfo.h
llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
llvm/lib/Target/ARM/MVEVPTOptimisationsPass.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-default.mir
llvm/test/CodeGen/Thumb2/LowOverheadLoops/biquad-cascade-optsize-strd-lr.mir
llvm/test/CodeGen/Thumb2/LowOverheadLoops/loop-dec-copy-chain.mir
llvm/test/CodeGen/Thumb2/LowOverheadLoops/revert-non-loop.mir
llvm/test/CodeGen/Thumb2/LowOverheadLoops/unsafe-cpsr-loop-use.mir

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
index 6426d7d85dcd..f095397ec3f9 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
@@ -19,6 +19,7 @@
 #include "ARMSubtarget.h"
 #include "MCTargetDesc/ARMAddressingModes.h"
 #include "MCTargetDesc/ARMBaseInfo.h"
+#include "MVETailPredUtils.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallSet.h"

diff  --git a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h 
b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
index 461a83693c79..234e8db88d26 100644
--- a/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
+++ b/llvm/lib/Target/ARM/ARMBaseInstrInfo.h
@@ -614,56 +614,6 @@ unsigned VCMPOpcodeToVPT(unsigned Opcode) {
   }
 }
 
-static inline
-unsigned VCTPOpcodeToLSTP(unsigned Opcode, bool IsDoLoop) {
-  switch (Opcode) {
-  default:
-llvm_unreachable("unhandled vctp opcode");
-break;
-  case ARM::MVE_VCTP8:
-return IsDoLoop ? ARM::MVE_DLSTP_8 : ARM::MVE_WLSTP_8;
-  case ARM::MVE_VCTP16:
-return IsDoLoop ? ARM::MVE_DLSTP_16 : ARM::MVE_WLSTP_16;
-  case ARM::MVE_VCTP32:
-return IsDoLoop ? ARM::MVE_DLSTP_32 : ARM::MVE_WLSTP_32;
-  case ARM::MVE_VCTP64:
-return IsDoLoop ? ARM::MVE_DLSTP_64 : ARM::MVE_WLSTP_64;
-  }
-  return 0;
-}
-
-static inline unsigned getTailPredVectorWidth(unsigned Opcode) {
-  switch (Opcode) {
-  default:
-llvm_unreachable("unhandled vctp opcode");
-  case ARM::MVE_VCTP8:  return 16;
-  case ARM::MVE_VCTP16: return 8;
-  case ARM::MVE_VCTP32: return 4;
-  case ARM::MVE_VCTP64: return 2;
-  }
-  return 0;
-}
-
-static inline bool isVCTP(const MachineInstr *MI) {
-  switch (MI->getOpcode()) {
-  default:
-break;
-  case ARM::MVE_VCTP8:
-  case ARM::MVE_VCTP16:
-  case ARM::MVE_VCTP32:
-  case ARM::MVE_VCTP64:
-return true;
-  }
-  return false;
-}
-
-static inline
-bool isLoopStart(MachineInstr ) {
-  return MI.getOpcode() == ARM::t2DoLoopStart ||
- MI.getOpcode() == ARM::t2DoLoopStartTP ||
- MI.getOpcode() == ARM::t2WhileLoopStart;
-}
-
 static inline
 bool isCondBranchOpcode(int Opc) {
   return Opc == ARM::Bcc || Opc == ARM::tBcc || Opc == ARM::t2Bcc;

diff  --git a/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp 
b/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
index 0f0418901bec..6901272496a0 100644
--- a/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
+++ b/llvm/lib/Target/ARM/ARMLowOverheadLoops.cpp
@@ -56,6 +56,7 @@
 #include "ARMBaseRegisterInfo.h"
 #include "ARMBasicBlockInfo.h"
 #include "ARMSubtarget.h"
+#include "MVETailPredUtils.h"
 #include "Thumb2InstrInfo.h"
 #include "llvm/ADT/SetOperations.h"
 #include "llvm/ADT/SmallSet.h"
@@ -1310,33 +1311,16 @@ bool ARMLowOverheadLoops::ProcessLoop(MachineLoop *ML) {
 // another low register.
 void ARMLowOverheadLoops::RevertWhile(MachineInstr *MI) const {
   LLVM_DEBUG(dbgs() << "ARM Loops: Reverting to cmp: " << 

[llvm-branch-commits] [llvm] 99eb0f1 - [Intrinsics] Re-remove experimental_vector_reduce intrinsics

2020-12-02 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-02T09:22:41Z
New Revision: 99eb0f16c35cdaa04dea4c5bbad4f86408e9dcfd

URL: 
https://github.com/llvm/llvm-project/commit/99eb0f16c35cdaa04dea4c5bbad4f86408e9dcfd
DIFF: 
https://github.com/llvm/llvm-project/commit/99eb0f16c35cdaa04dea4c5bbad4f86408e9dcfd.diff

LOG: [Intrinsics] Re-remove experimental_vector_reduce intrinsics

These were re-added by fbfb1c790982277eaa5134c2b6aa001e97fe828d but
should not have been. This removes the old experimental versions of the
reduction intrinsics again, leaving the new non experimental ones.

Differential Revision: https://reviews.llvm.org/D92411

Added: 


Modified: 
llvm/include/llvm/IR/Intrinsics.td

Removed: 




diff  --git a/llvm/include/llvm/IR/Intrinsics.td 
b/llvm/include/llvm/IR/Intrinsics.td
index 6f7317827ef8..9e64a61cf481 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1518,34 +1518,6 @@ let IntrProperties = [IntrNoMem] in {
  [llvm_anyvector_ty]>;
   def int_vector_reduce_fmin : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
  [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_v2_fadd : 
DefaultAttrsIntrinsic<[llvm_anyfloat_ty],
- [LLVMMatchType<0>,
-  llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_v2_fmul : 
DefaultAttrsIntrinsic<[llvm_anyfloat_ty],
- [LLVMMatchType<0>,
-  llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_add : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
- [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_mul : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
- [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_and : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
- [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_or : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-[llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_xor : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
- [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_smax : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-  [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_smin : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-  [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_umax : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-  [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_umin : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-  [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_fmax : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-  [llvm_anyvector_ty]>;
-  def int_experimental_vector_reduce_fmin : 
DefaultAttrsIntrinsic<[LLVMVectorElementType<0>],
-  [llvm_anyvector_ty]>;
 }
 
 //===- Matrix intrinsics -===//



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] eedf0ed - [ARM] Mark select and selectcc of MVE vector operations as expand.

2020-12-01 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-01T15:05:55Z
New Revision: eedf0ed63e82ba2f8d2cbc12d6dae61035ed4f9a

URL: 
https://github.com/llvm/llvm-project/commit/eedf0ed63e82ba2f8d2cbc12d6dae61035ed4f9a
DIFF: 
https://github.com/llvm/llvm-project/commit/eedf0ed63e82ba2f8d2cbc12d6dae61035ed4f9a.diff

LOG: [ARM] Mark select and selectcc of MVE vector operations as expand.

We already expand select and select_cc in codegenprepare, but they can
still be generated under some situations. Explicitly mark them as expand
to ensure they are not produced, leading to a failure to select the
nodes.

Differential Revision: https://reviews.llvm.org/D92373

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/Thumb2/mve-selectcc.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 0426a560805a..bc9222151899 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -289,6 +289,8 @@ void ARMTargetLowering::addMVEVectorTypes(bool HasMVEFP) {
 setOperationAction(ISD::UDIVREM, VT, Expand);
 setOperationAction(ISD::SDIVREM, VT, Expand);
 setOperationAction(ISD::CTPOP, VT, Expand);
+setOperationAction(ISD::SELECT, VT, Expand);
+setOperationAction(ISD::SELECT_CC, VT, Expand);
 
 // Vector reductions
 setOperationAction(ISD::VECREDUCE_ADD, VT, Legal);
@@ -335,6 +337,8 @@ void ARMTargetLowering::addMVEVectorTypes(bool HasMVEFP) {
 setOperationAction(ISD::SETCC, VT, Custom);
 setOperationAction(ISD::MLOAD, VT, Custom);
 setOperationAction(ISD::MSTORE, VT, Legal);
+setOperationAction(ISD::SELECT, VT, Expand);
+setOperationAction(ISD::SELECT_CC, VT, Expand);
 
 // Pre and Post inc are supported on loads and stores
 for (unsigned im = (unsigned)ISD::PRE_INC;

diff  --git a/llvm/test/CodeGen/Thumb2/mve-selectcc.ll 
b/llvm/test/CodeGen/Thumb2/mve-selectcc.ll
index 2633d2a3b2f5..b4f5d8d8fa3f 100644
--- a/llvm/test/CodeGen/Thumb2/mve-selectcc.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-selectcc.ll
@@ -203,3 +203,53 @@ entry:
   %s = select i1 %c,  <2 x double> %s0, <2 x double> %s1
   ret <2 x double> %s
 }
+
+define i32 @e() {
+; CHECK-LABEL: e:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:adr r0, .LCPI14_0
+; CHECK-NEXT:vmov.i32 q1, #0x4
+; CHECK-NEXT:vldrw.u32 q0, [r0]
+; CHECK-NEXT:movs r0, #0
+; CHECK-NEXT:vmov q2, q0
+; CHECK-NEXT:  .LBB14_1: @ %vector.body
+; CHECK-NEXT:@ =>This Inner Loop Header: Depth=1
+; CHECK-NEXT:adds r0, #4
+; CHECK-NEXT:vadd.i32 q2, q2, q1
+; CHECK-NEXT:cmp r0, #8
+; CHECK-NEXT:cset r1, eq
+; CHECK-NEXT:tst.w r1, #1
+; CHECK-NEXT:csetm r1, ne
+; CHECK-NEXT:subs.w r2, r0, #8
+; CHECK-NEXT:vdup.32 q3, r1
+; CHECK-NEXT:csel r0, r0, r2, ne
+; CHECK-NEXT:vbic q2, q2, q3
+; CHECK-NEXT:vand q3, q3, q0
+; CHECK-NEXT:vorr q2, q3, q2
+; CHECK-NEXT:b .LBB14_1
+; CHECK-NEXT:.p2align 4
+; CHECK-NEXT:  @ %bb.2:
+; CHECK-NEXT:  .LCPI14_0:
+; CHECK-NEXT:.long 0 @ 0x0
+; CHECK-NEXT:.long 1 @ 0x1
+; CHECK-NEXT:.long 2 @ 0x2
+; CHECK-NEXT:.long 3 @ 0x3
+entry:
+  br label %vector.body
+
+vector.body:  ; preds = 
%pred.store.continue73, %entry
+  %index = phi i32 [ 0, %entry ], [ %spec.select, %pred.store.continue73 ]
+  %vec.ind = phi <4 x i32> [ , %entry ], [ 
%spec.select74, %pred.store.continue73 ]
+  %l3 = icmp ult <4 x i32> %vec.ind, 
+  %l4 = extractelement <4 x i1> %l3, i32 0
+  br label %pred.store.continue73
+
+pred.store.continue73:; preds = %pred.store.if72, 
%pred.store.continue71
+  %index.next = add i32 %index, 4
+  %vec.ind.next = add <4 x i32> %vec.ind, 
+  %l60 = icmp eq i32 %index.next, 8
+  %spec.select = select i1 %l60, i32 0, i32 %index.next
+  %spec.select74 = select i1 %l60, <4 x i32> , <4 
x i32> %vec.ind.next
+  br label %vector.body
+}
+



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] 09d82fa - [AArch64] Update pass pipeline test. NFC

2020-12-01 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-01T10:40:04Z
New Revision: 09d82fa95f4561a6a2ce80bce00209018ba70c24

URL: 
https://github.com/llvm/llvm-project/commit/09d82fa95f4561a6a2ce80bce00209018ba70c24
DIFF: 
https://github.com/llvm/llvm-project/commit/09d82fa95f4561a6a2ce80bce00209018ba70c24.diff

LOG: [AArch64] Update pass pipeline test. NFC

Added: 


Modified: 
llvm/test/CodeGen/AArch64/O3-pipeline.ll

Removed: 




diff  --git a/llvm/test/CodeGen/AArch64/O3-pipeline.ll 
b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
index 364c58f4acdf..28753d646b85 100644
--- a/llvm/test/CodeGen/AArch64/O3-pipeline.ll
+++ b/llvm/test/CodeGen/AArch64/O3-pipeline.ll
@@ -9,9 +9,9 @@
 ; CHECK-NEXT: Machine Module Information
 ; CHECK-NEXT: Target Transform Information
 ; CHECK-NEXT: Assumption Cache Tracker
+; CHECK-NEXT: Profile summary info
 ; CHECK-NEXT: Type-Based Alias Analysis
 ; CHECK-NEXT: Scoped NoAlias Alias Analysis
-; CHECK-NEXT: Profile summary info
 ; CHECK-NEXT: Create Garbage Collector Module Metadata
 ; CHECK-NEXT: Machine Branch Probability Analysis
 ; CHECK-NEXT:   ModulePass Manager



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] 7923d71 - [ARM] PREDICATE_CAST demanded bits

2020-12-01 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-12-01T10:32:24Z
New Revision: 7923d71b4a7a88f97c8a3efe1eb1473a4b2f5bf3

URL: 
https://github.com/llvm/llvm-project/commit/7923d71b4a7a88f97c8a3efe1eb1473a4b2f5bf3
DIFF: 
https://github.com/llvm/llvm-project/commit/7923d71b4a7a88f97c8a3efe1eb1473a4b2f5bf3.diff

LOG: [ARM] PREDICATE_CAST demanded bits

The PREDICATE_CAST node is used to model moves between MVE predicate
registers and gpr's, and eventually become a VMSR p0, rn. When moving to
a predicate only the bottom 16 bits of the sources register are
demanded. This adds a simple fold for that, allowing it to potentially
remove instructions like uxth.

Differential Revision: https://reviews.llvm.org/D92213

Added: 


Modified: 
llvm/lib/Target/ARM/ARMISelLowering.cpp
llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp 
b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index c94b9e64632f..0426a560805a 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -13844,6 +13844,13 @@ PerformPREDICATE_CASTCombine(SDNode *N, 
TargetLowering::DAGCombinerInfo ) {
 return DCI.DAG.getNode(ARMISD::PREDICATE_CAST, dl, VT, Op->getOperand(0));
   }
 
+  // Only the bottom 16 bits of the source register are used.
+  if (Op.getValueType() == MVT::i32) {
+APInt DemandedMask = APInt::getLowBitsSet(32, 16);
+const TargetLowering  = DCI.DAG.getTargetLoweringInfo();
+if (TLI.SimplifyDemandedBits(Op, DemandedMask, DCI))
+  return SDValue(N, 0);
+  }
   return SDValue();
 }
 

diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
index fff9ad871027..c7e553fa3510 100644
--- a/llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-bitcast.ll
@@ -139,10 +139,9 @@ define arm_aapcs_vfpcc <16 x i8> @bitcast_to_v16i1(i16 %b, 
<16 x i8> %a) {
 ; CHECK-LE-NEXT:mov r4, sp
 ; CHECK-LE-NEXT:bfc r4, #0, #4
 ; CHECK-LE-NEXT:mov sp, r4
-; CHECK-LE-NEXT:uxth r0, r0
 ; CHECK-LE-NEXT:sub.w r4, r7, #8
-; CHECK-LE-NEXT:vmov.i32 q1, #0x0
 ; CHECK-LE-NEXT:vmsr p0, r0
+; CHECK-LE-NEXT:vmov.i32 q1, #0x0
 ; CHECK-LE-NEXT:vpsel q0, q0, q1
 ; CHECK-LE-NEXT:mov sp, r4
 ; CHECK-LE-NEXT:pop {r4, r6, r7, pc}
@@ -160,7 +159,6 @@ define arm_aapcs_vfpcc <16 x i8> @bitcast_to_v16i1(i16 %b, 
<16 x i8> %a) {
 ; CHECK-BE-NEXT:mov sp, r4
 ; CHECK-BE-NEXT:vrev64.8 q1, q0
 ; CHECK-BE-NEXT:vmov.i32 q0, #0x0
-; CHECK-BE-NEXT:uxth r0, r0
 ; CHECK-BE-NEXT:sub.w r4, r7, #8
 ; CHECK-BE-NEXT:vrev32.8 q0, q0
 ; CHECK-BE-NEXT:vmsr p0, r0

diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
index afad0077bbe7..17f57743c301 100644
--- a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
@@ -51,10 +51,8 @@ define arm_aapcs_vfpcc void @const(<8 x i16> %acc0, <8 x 
i16> %acc1, i32* nocapt
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:.save {r4, r6, r7, lr}
 ; CHECK-NEXT:push {r4, r6, r7, lr}
-; CHECK-NEXT:uxth r2, r1
+; CHECK-NEXT:vmsr p0, r1
 ; CHECK-NEXT:mvns r1, r1
-; CHECK-NEXT:vmsr p0, r2
-; CHECK-NEXT:uxth r1, r1
 ; CHECK-NEXT:vpstt
 ; CHECK-NEXT:vaddvt.s16 r12, q1
 ; CHECK-NEXT:vaddvt.s16 r2, q0
@@ -92,7 +90,6 @@ define arm_aapcs_vfpcc <4 x i32> @xorvpnot_i32(<4 x i32> 
%acc0, i16 signext %p0)
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:mvns r0, r0
 ; CHECK-NEXT:vmov.i32 q1, #0x0
-; CHECK-NEXT:uxth r0, r0
 ; CHECK-NEXT:vmsr p0, r0
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
@@ -109,7 +106,6 @@ define arm_aapcs_vfpcc <8 x i16> @xorvpnot_i16(<8 x i16> 
%acc0, i16 signext %p0)
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:mvns r0, r0
 ; CHECK-NEXT:vmov.i32 q1, #0x0
-; CHECK-NEXT:uxth r0, r0
 ; CHECK-NEXT:vmsr p0, r0
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr
@@ -126,7 +122,6 @@ define arm_aapcs_vfpcc <16 x i8> @xorvpnot_i8(<16 x i8> 
%acc0, i16 signext %p0)
 ; CHECK:   @ %bb.0: @ %entry
 ; CHECK-NEXT:mvns r0, r0
 ; CHECK-NEXT:vmov.i32 q1, #0x0
-; CHECK-NEXT:uxth r0, r0
 ; CHECK-NEXT:vmsr p0, r0
 ; CHECK-NEXT:vpsel q0, q0, q1
 ; CHECK-NEXT:bx lr



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] d5387c0 - [ARM] Constant predicate tests. NFC

2020-11-30 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-30T09:18:25Z
New Revision: d5387c044d96cda70701fcb7fb3ad06955957ed4

URL: 
https://github.com/llvm/llvm-project/commit/d5387c044d96cda70701fcb7fb3ad06955957ed4
DIFF: 
https://github.com/llvm/llvm-project/commit/d5387c044d96cda70701fcb7fb3ad06955957ed4.diff

LOG: [ARM] Constant predicate tests. NFC

Added: 
llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll

Modified: 


Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll 
b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
new file mode 100644
index ..afad0077bbe7
--- /dev/null
+++ b/llvm/test/CodeGen/Thumb2/mve-pred-constfold.ll
@@ -0,0 +1,153 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve 
-verify-machineinstrs %s -o - | FileCheck %s
+
+define arm_aapcs_vfpcc void @reg(<8 x i16> %acc0, <8 x i16> %acc1, i32* 
nocapture %px, i16 signext %p0) {
+; CHECK-LABEL: reg:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r4, r6, r7, lr}
+; CHECK-NEXT:push {r4, r6, r7, lr}
+; CHECK-NEXT:.pad #8
+; CHECK-NEXT:sub sp, #8
+; CHECK-NEXT:movw r1, #52428
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:movw r1, #13107
+; CHECK-NEXT:vstr p0, [sp, #4] @ 4-byte Spill
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvt.s16 r12, q1
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vstr p0, [sp] @ 4-byte Spill
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvt.s16 r2, q1
+; CHECK-NEXT:vldr p0, [sp, #4] @ 4-byte Reload
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvt.s16 r4, q0
+; CHECK-NEXT:vldr p0, [sp] @ 4-byte Reload
+; CHECK-NEXT:vpst
+; CHECK-NEXT:vaddvt.s16 r6, q0
+; CHECK-NEXT:strd r6, r4, [r0]
+; CHECK-NEXT:strd r2, r12, [r0, #8]
+; CHECK-NEXT:add sp, #8
+; CHECK-NEXT:pop {r4, r6, r7, pc}
+entry:
+  %0 = tail call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 13107)
+  %1 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc0, 
i32 0, <8 x i1> %0)
+  %2 = tail call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 52428)
+  %3 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc0, 
i32 0, <8 x i1> %2)
+  %4 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc1, 
i32 0, <8 x i1> %0)
+  %5 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc1, 
i32 0, <8 x i1> %2)
+  store i32 %1, i32* %px, align 4
+  %arrayidx1 = getelementptr inbounds i32, i32* %px, i32 1
+  store i32 %3, i32* %arrayidx1, align 4
+  %arrayidx2 = getelementptr inbounds i32, i32* %px, i32 2
+  store i32 %4, i32* %arrayidx2, align 4
+  %arrayidx3 = getelementptr inbounds i32, i32* %px, i32 3
+  store i32 %5, i32* %arrayidx3, align 4
+  ret void
+}
+
+
+define arm_aapcs_vfpcc void @const(<8 x i16> %acc0, <8 x i16> %acc1, i32* 
nocapture %px, i16 signext %p0) {
+; CHECK-LABEL: const:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:.save {r4, r6, r7, lr}
+; CHECK-NEXT:push {r4, r6, r7, lr}
+; CHECK-NEXT:uxth r2, r1
+; CHECK-NEXT:mvns r1, r1
+; CHECK-NEXT:vmsr p0, r2
+; CHECK-NEXT:uxth r1, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvt.s16 r12, q1
+; CHECK-NEXT:vaddvt.s16 r2, q0
+; CHECK-NEXT:vmsr p0, r1
+; CHECK-NEXT:vpstt
+; CHECK-NEXT:vaddvt.s16 r4, q1
+; CHECK-NEXT:vaddvt.s16 r6, q0
+; CHECK-NEXT:stm.w r0, {r2, r6, r12}
+; CHECK-NEXT:str r4, [r0, #12]
+; CHECK-NEXT:pop {r4, r6, r7, pc}
+entry:
+  %0 = zext i16 %p0 to i32
+  %1 = tail call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 %0)
+  %2 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc0, 
i32 0, <8 x i1> %1)
+  %3 = xor i16 %p0, -1
+  %4 = zext i16 %3 to i32
+  %5 = tail call <8 x i1> @llvm.arm.mve.pred.i2v.v8i1(i32 %4)
+  %6 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc0, 
i32 0, <8 x i1> %5)
+  %7 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc1, 
i32 0, <8 x i1> %1)
+  %8 = tail call i32 @llvm.arm.mve.addv.predicated.v8i16.v8i1(<8 x i16> %acc1, 
i32 0, <8 x i1> %5)
+  store i32 %2, i32* %px, align 4
+  %arrayidx1 = getelementptr inbounds i32, i32* %px, i32 1
+  store i32 %6, i32* %arrayidx1, align 4
+  %arrayidx2 = getelementptr inbounds i32, i32* %px, i32 2
+  store i32 %7, i32* %arrayidx2, align 4
+  %arrayidx3 = getelementptr inbounds i32, i32* %px, i32 3
+  store i32 %8, i32* %arrayidx3, align 4
+  ret void
+}
+
+
+
+define arm_aapcs_vfpcc <4 x i32> @xorvpnot_i32(<4 x i32> %acc0, i16 signext 
%p0) {
+; CHECK-LABEL: xorvpnot_i32:
+; CHECK:   @ %bb.0: @ %entry
+; CHECK-NEXT:mvns r0, r0
+; CHECK-NEXT:vmov.i32 q1, #0x0
+; CHECK-NEXT:uxth r0, r0
+; CHECK-NEXT:vmsr p0, r0
+; CHECK-NEXT:vpsel q0, q0, q1
+; CHECK-NEXT:bx lr
+entry:
+  %l3 = xor i16 %p0, -1
+  %l4 = zext i16 %l3 to i32
+  %l5 = tail call <4 x i1> 

[llvm-branch-commits] [llvm] d939ba4 - [ARM] MVE qabs vectorization test. NFC

2020-11-27 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-27T12:21:11Z
New Revision: d939ba4c6853ed469a7fd198c751a158cc7e5c59

URL: 
https://github.com/llvm/llvm-project/commit/d939ba4c6853ed469a7fd198c751a158cc7e5c59
DIFF: 
https://github.com/llvm/llvm-project/commit/d939ba4c6853ed469a7fd198c751a158cc7e5c59.diff

LOG: [ARM] MVE qabs vectorization test. NFC

Added: 
llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll

Modified: 


Removed: 




diff  --git a/llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll 
b/llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll
new file mode 100644
index ..903b467c7581
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll
@@ -0,0 +1,292 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -loop-vectorize -instcombine -simplifycfg < %s -S -o - | FileCheck 
%s
+
+target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
+target triple = "thumbv8.1m.main-arm-none-eabi"
+
+define void @arm_abs_q7(i8* nocapture readonly %pSrc, i8* nocapture %pDst, i32 
%blockSize) #0 {
+; CHECK-LABEL: @arm_abs_q7(
+; CHECK-NEXT:  entry:
+; CHECK-NEXT:[[CMP_NOT19:%.*]] = icmp eq i32 [[BLOCKSIZE:%.*]], 0
+; CHECK-NEXT:br i1 [[CMP_NOT19]], label [[WHILE_END:%.*]], label 
[[WHILE_BODY_PREHEADER:%.*]]
+; CHECK:   while.body.preheader:
+; CHECK-NEXT:[[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[BLOCKSIZE]], 16
+; CHECK-NEXT:br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.*]], label 
[[VECTOR_MEMCHECK:%.*]]
+; CHECK:   vector.memcheck:
+; CHECK-NEXT:[[SCEVGEP:%.*]] = getelementptr i8, i8* [[PDST:%.*]], i32 
[[BLOCKSIZE]]
+; CHECK-NEXT:[[SCEVGEP1:%.*]] = getelementptr i8, i8* [[PSRC:%.*]], i32 
[[BLOCKSIZE]]
+; CHECK-NEXT:[[BOUND0:%.*]] = icmp ugt i8* [[SCEVGEP1]], [[PDST]]
+; CHECK-NEXT:[[BOUND1:%.*]] = icmp ugt i8* [[SCEVGEP]], [[PSRC]]
+; CHECK-NEXT:[[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
+; CHECK-NEXT:br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label 
[[VECTOR_PH:%.*]]
+; CHECK:   vector.ph:
+; CHECK-NEXT:[[N_VEC:%.*]] = and i32 [[BLOCKSIZE]], -16
+; CHECK-NEXT:[[IND_END:%.*]] = getelementptr i8, i8* [[PSRC]], i32 
[[N_VEC]]
+; CHECK-NEXT:[[IND_END3:%.*]] = and i32 [[BLOCKSIZE]], 15
+; CHECK-NEXT:[[IND_END5:%.*]] = getelementptr i8, i8* [[PDST]], i32 
[[N_VEC]]
+; CHECK-NEXT:br label [[VECTOR_BODY:%.*]]
+; CHECK:   vector.body:
+; CHECK-NEXT:[[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ 
[[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:[[NEXT_GEP:%.*]] = getelementptr i8, i8* [[PSRC]], i32 
[[INDEX]]
+; CHECK-NEXT:[[NEXT_GEP6:%.*]] = getelementptr i8, i8* [[PDST]], i32 
[[INDEX]]
+; CHECK-NEXT:[[TMP0:%.*]] = bitcast i8* [[NEXT_GEP]] to <16 x i8>*
+; CHECK-NEXT:[[WIDE_LOAD:%.*]] = load <16 x i8>, <16 x i8>* [[TMP0]], 
align 1, !alias.scope !0
+; CHECK-NEXT:[[TMP1:%.*]] = icmp sgt <16 x i8> [[WIDE_LOAD]], 
zeroinitializer
+; CHECK-NEXT:[[TMP2:%.*]] = icmp eq <16 x i8> [[WIDE_LOAD]], 
+; CHECK-NEXT:[[TMP3:%.*]] = sub <16 x i8> zeroinitializer, [[WIDE_LOAD]]
+; CHECK-NEXT:[[TMP4:%.*]] = select <16 x i1> [[TMP2]], <16 x i8> , <16 x i8> [[TMP3]]
+; CHECK-NEXT:[[TMP5:%.*]] = select <16 x i1> [[TMP1]], <16 x i8> 
[[WIDE_LOAD]], <16 x i8> [[TMP4]]
+; CHECK-NEXT:[[TMP6:%.*]] = bitcast i8* [[NEXT_GEP6]] to <16 x i8>*
+; CHECK-NEXT:store <16 x i8> [[TMP5]], <16 x i8>* [[TMP6]], align 1, 
!alias.scope !3, !noalias !0
+; CHECK-NEXT:[[INDEX_NEXT]] = add i32 [[INDEX]], 16
+; CHECK-NEXT:[[TMP7:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label 
[[VECTOR_BODY]], [[LOOP5:!llvm.loop !.*]]
+; CHECK:   middle.block:
+; CHECK-NEXT:[[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[BLOCKSIZE]]
+; CHECK-NEXT:br i1 [[CMP_N]], label [[WHILE_END]], label [[SCALAR_PH]]
+; CHECK:   scalar.ph:
+; CHECK-NEXT:[[BC_RESUME_VAL:%.*]] = phi i8* [ [[IND_END]], 
[[MIDDLE_BLOCK]] ], [ [[PSRC]], [[WHILE_BODY_PREHEADER]] ], [ [[PSRC]], 
[[VECTOR_MEMCHECK]] ]
+; CHECK-NEXT:[[BC_RESUME_VAL2:%.*]] = phi i32 [ [[IND_END3]], 
[[MIDDLE_BLOCK]] ], [ [[BLOCKSIZE]], [[WHILE_BODY_PREHEADER]] ], [ 
[[BLOCKSIZE]], [[VECTOR_MEMCHECK]] ]
+; CHECK-NEXT:[[BC_RESUME_VAL4:%.*]] = phi i8* [ [[IND_END5]], 
[[MIDDLE_BLOCK]] ], [ [[PDST]], [[WHILE_BODY_PREHEADER]] ], [ [[PDST]], 
[[VECTOR_MEMCHECK]] ]
+; CHECK-NEXT:br label [[WHILE_BODY:%.*]]
+; CHECK:   while.body:
+; CHECK-NEXT:[[PSRC_ADDR_022:%.*]] = phi i8* [ [[INCDEC_PTR:%.*]], 
[[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
+; CHECK-NEXT:[[BLKCNT_021:%.*]] = phi i32 [ [[DEC:%.*]], [[WHILE_BODY]] ], 
[ [[BC_RESUME_VAL2]], [[SCALAR_PH]] ]
+; CHECK-NEXT:[[PDST_ADDR_020:%.*]] = phi i8* [ [[INCDEC_PTR13:%.*]], 
[[WHILE_BODY]] ], [ [[BC_RESUME_VAL4]], [[SCALAR_PH]] ]
+; CHECK-NEXT:[[INCDEC_PTR]] = 

[llvm-branch-commits] [llvm] 0e49a40 - [ARM] Cleanup for the MVETailPrediction pass

2020-11-26 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-26T15:10:44Z
New Revision: 0e49a40d756b4487aebea436f8f84411c1a629e7

URL: 
https://github.com/llvm/llvm-project/commit/0e49a40d756b4487aebea436f8f84411c1a629e7
DIFF: 
https://github.com/llvm/llvm-project/commit/0e49a40d756b4487aebea436f8f84411c1a629e7.diff

LOG: [ARM] Cleanup for the MVETailPrediction pass

This strips out a lot of the code that should no longer be needed from
the MVETailPredictionPass, leaving the important part - find active lane
mask instructions and convert them to VCTP operations.

Differential Revision: https://reviews.llvm.org/D91866

Added: 


Modified: 
llvm/lib/Target/ARM/MVETailPredication.cpp
llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-intrinsic-round.ll
llvm/test/CodeGen/Thumb2/active_lane_mask.ll

Removed: 




diff  --git a/llvm/lib/Target/ARM/MVETailPredication.cpp 
b/llvm/lib/Target/ARM/MVETailPredication.cpp
index 25d5fd7e69c6..8055b5cf500d 100644
--- a/llvm/lib/Target/ARM/MVETailPredication.cpp
+++ b/llvm/lib/Target/ARM/MVETailPredication.cpp
@@ -22,23 +22,13 @@
 /// The HardwareLoops pass inserts intrinsics identifying loops that the
 /// backend will attempt to convert into a low-overhead loop. The vectorizer is
 /// responsible for generating a vectorized loop in which the lanes are
-/// predicated upon the iteration counter. This pass looks at these predicated
-/// vector loops, that are targets for low-overhead loops, and prepares it for
-/// code generation. Once the vectorizer has produced a masked loop, there's a
-/// couple of final forms:
-/// - A tail-predicated loop, with implicit predication.
-/// - A loop containing multiple VCPT instructions, predicating multiple VPT
-///   blocks of instructions operating on 
diff erent vector types.
-///
-/// This pass:
-/// 1) Checks if the predicates of the masked load/store instructions are
-///generated by intrinsic @llvm.get.active.lanes(). This intrinsic consumes
-///the the scalar loop tripcount as its second argument, which we extract
-///to set up the number of elements processed by the loop.
-/// 2) Intrinsic @llvm.get.active.lanes() is then replaced by the MVE target
-///specific VCTP intrinsic to represent the effect of tail predication.
-///This will be picked up by the ARM Low-overhead loop pass, which performs
-///the final transformation to a DLSTP or WLSTP tail-predicated loop.
+/// predicated upon an get.active.lane.mask intrinsic. This pass looks at these
+/// get.active.lane.mask intrinsic and attempts to convert them to VCTP
+/// instructions. This will be picked up by the ARM Low-overhead loop pass 
later
+/// in the backend, which performs the final transformation to a DLSTP or WLSTP
+/// tail-predicated loop.
+//
+//===--===//
 
 #include "ARM.h"
 #include "ARMSubtarget.h"
@@ -57,6 +47,7 @@
 #include "llvm/InitializePasses.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/Utils/BasicBlockUtils.h"
+#include "llvm/Transforms/Utils/Local.h"
 #include "llvm/Transforms/Utils/LoopUtils.h"
 #include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"
 
@@ -112,23 +103,18 @@ class MVETailPredication : public LoopPass {
   bool runOnLoop(Loop *L, LPPassManager&) override;
 
 private:
-  /// Perform the relevant checks on the loop and convert if possible.
-  bool TryConvert(Value *TripCount);
-
-  /// Return whether this is a vectorized loop, that contains masked
-  /// load/stores.
-  bool IsPredicatedVectorLoop();
+  /// Perform the relevant checks on the loop and convert active lane masks if
+  /// possible.
+  bool TryConvertActiveLaneMask(Value *TripCount);
 
   /// Perform several checks on the arguments of @llvm.get.active.lane.mask
   /// intrinsic. E.g., check that the loop induction variable and the element
   /// count are of the form we expect, and also perform overflow checks for
   /// the new expressions that are created.
-  bool IsSafeActiveMask(IntrinsicInst *ActiveLaneMask, Value *TripCount,
-FixedVectorType *VecTy);
+  bool IsSafeActiveMask(IntrinsicInst *ActiveLaneMask, Value *TripCount);
 
   /// Insert the intrinsic to represent the effect of tail predication.
-  void InsertVCTPIntrinsic(IntrinsicInst *ActiveLaneMask, Value *TripCount,
-   FixedVectorType *VecTy);
+  void InsertVCTPIntrinsic(IntrinsicInst *ActiveLaneMask, Value *TripCount);
 
   /// Rematerialize the iteration count in exit blocks, which enables
   /// ARMLowOverheadLoops to better optimise away loop update statements inside
@@ -138,25 +124,6 @@ class MVETailPredication : public LoopPass {
 
 } // end namespace
 
-static bool IsDecrement(Instruction ) {
-  auto *Call = dyn_cast();
-  if (!Call)
-return false;
-
-  Intrinsic::ID ID = Call->getIntrinsicID();
-  return ID == Intrinsic::loop_decrement_reg;
-}
-
-static bool 

[llvm-branch-commits] [llvm] e0c479c - [VPlan] Switch VPWidenRecipe to be a VPValue

2020-11-25 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-25T08:25:06Z
New Revision: e0c479cd0e03279784925ece209ff53bdbb86cf8

URL: 
https://github.com/llvm/llvm-project/commit/e0c479cd0e03279784925ece209ff53bdbb86cf8
DIFF: 
https://github.com/llvm/llvm-project/commit/e0c479cd0e03279784925ece209ff53bdbb86cf8.diff

LOG: [VPlan] Switch VPWidenRecipe to be a VPValue

Similar to other patches, this makes VPWidenRecipe a VPValue. Because of
the way it interacts with the reduction code it also slightly alters the
way that VPValues are registered, removing the up front NeedDef and
using getOrAddVPValue to create them on-demand if needed instead.

Differential Revision: https://reviews.llvm.org/D88447

Added: 


Modified: 
llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
llvm/lib/Transforms/Vectorize/VPlan.cpp
llvm/lib/Transforms/Vectorize/VPlan.h
llvm/lib/Transforms/Vectorize/VPlanValue.h
llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll
llvm/test/Transforms/LoopVectorize/vplan-printing.ll

Removed: 




diff  --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h 
b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index b3b744947c1f..ec88bebe684d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -292,8 +292,7 @@ class LoopVectorizationPlanner {
   /// Build a VPlan using VPRecipes according to the information gather by
   /// Legal. This method is only used for the legacy inner loop vectorizer.
   VPlanPtr buildVPlanWithVPRecipes(
-  VFRange , SmallPtrSetImpl ,
-  SmallPtrSetImpl ,
+  VFRange , SmallPtrSetImpl ,
   const DenseMap );
 
   /// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,

diff  --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp 
b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 35af7a445eef..97c9011d8086 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -461,7 +461,7 @@ class InnerLoopVectorizer {
   BasicBlock *createVectorizedLoopSkeleton();
 
   /// Widen a single instruction within the innermost loop.
-  void widenInstruction(Instruction , VPUser ,
+  void widenInstruction(Instruction , VPValue *Def, VPUser ,
 VPTransformState );
 
   /// Widen a single call instruction within the innermost loop.
@@ -4512,7 +4512,8 @@ static bool mayDivideByZero(Instruction ) {
   return !CInt || CInt->isZero();
 }
 
-void InnerLoopVectorizer::widenInstruction(Instruction , VPUser ,
+void InnerLoopVectorizer::widenInstruction(Instruction , VPValue *Def,
+   VPUser ,
VPTransformState ) {
   assert(!VF.isScalable() && "scalable vectors not yet supported.");
   switch (I.getOpcode()) {
@@ -4555,7 +4556,7 @@ void InnerLoopVectorizer::widenInstruction(Instruction 
, VPUser ,
 VecOp->copyIRFlags();
 
   // Use this vector value for all users of the original instruction.
-  VectorLoopValueMap.setVectorValue(, Part, V);
+  State.set(Def, , V, Part);
   addMetadata(V, );
 }
 
@@ -4579,7 +4580,7 @@ void InnerLoopVectorizer::widenInstruction(Instruction 
, VPUser ,
   } else {
 C = Builder.CreateICmp(Cmp->getPredicate(), A, B);
   }
-  VectorLoopValueMap.setVectorValue(, Part, C);
+  State.set(Def, , C, Part);
   addMetadata(C, );
 }
 
@@ -4609,7 +4610,7 @@ void InnerLoopVectorizer::widenInstruction(Instruction 
, VPUser ,
 for (unsigned Part = 0; Part < UF; ++Part) {
   Value *A = State.get(User.getOperand(0), Part);
   Value *Cast = Builder.CreateCast(CI->getOpcode(), A, DestTy);
-  VectorLoopValueMap.setVectorValue(, Part, Cast);
+  State.set(Def, , Cast, Part);
   addMetadata(Cast, );
 }
 break;
@@ -7262,7 +7263,7 @@ VPValue *VPRecipeBuilder::createEdgeMask(BasicBlock *Src, 
BasicBlock *Dst,
   if (!BI->isConditional() || BI->getSuccessor(0) == BI->getSuccessor(1))
 return EdgeMaskCache[Edge] = SrcMask;
 
-  VPValue *EdgeMask = Plan->getVPValue(BI->getCondition());
+  VPValue *EdgeMask = Plan->getOrAddVPValue(BI->getCondition());
   assert(EdgeMask && "No Edge Mask found for condition");
 
   if (BI->getSuccessor(0) != Dst)
@@ -7300,7 +7301,7 @@ VPValue *VPRecipeBuilder::createBlockInMask(BasicBlock 
*BB, VPlanPtr ) {
 // Start by constructing the desired canonical IV.
 VPValue *IV = nullptr;
 if (Legal->getPrimaryInduction())
-  IV = Plan->getVPValue(Legal->getPrimaryInduction());
+  IV = Plan->getOrAddVPValue(Legal->getPrimaryInduction());
 else {
   auto IVRecipe = new VPWidenCanonicalIVRecipe();
   Builder.getInsertBlock()->insert(IVRecipe, NewInsertionPoint);
@@ -7648,24 +7649,6 @@ void 

[llvm-branch-commits] [llvm] 00a6601 - [VPlan] Turn VPReductionRecipe into a VPValue

2020-11-25 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-25T08:25:05Z
New Revision: 00a66011366c7b037d6680e6015524a41b761c34

URL: 
https://github.com/llvm/llvm-project/commit/00a66011366c7b037d6680e6015524a41b761c34
DIFF: 
https://github.com/llvm/llvm-project/commit/00a66011366c7b037d6680e6015524a41b761c34.diff

LOG: [VPlan] Turn VPReductionRecipe into a VPValue

This converts the VPReductionRecipe into a VPValue, like other
VPRecipe's in preparation for traversing def-use chains. It also makes
it a VPUser, now storing the used VPValues as operands.

It doesn't yet change how the VPReductionRecipes are created. It will
need to call replaceAllUsesWith from the original recipe they replace,
but that is not done yet as VPWidenRecipe need to be created first.

Differential Revision: https://reviews.llvm.org/D88382

Added: 


Modified: 
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
llvm/lib/Transforms/Vectorize/VPlan.cpp
llvm/lib/Transforms/Vectorize/VPlan.h
llvm/lib/Transforms/Vectorize/VPlanValue.h
llvm/test/Transforms/LoopVectorize/vplan-printing.ll

Removed: 




diff  --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp 
b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index e29a0a8bd666..35af7a445eef 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8093,9 +8093,9 @@ void VPReductionRecipe::execute(VPTransformState ) {
   assert(!State.Instance && "Reduction being replicated.");
   for (unsigned Part = 0; Part < State.UF; ++Part) {
 RecurrenceDescriptor::RecurrenceKind Kind = RdxDesc->getRecurrenceKind();
-Value *NewVecOp = State.get(VecOp, Part);
-if (CondOp) {
-  Value *NewCond = State.get(CondOp, Part);
+Value *NewVecOp = State.get(getVecOp(), Part);
+if (VPValue *Cond = getCondOp()) {
+  Value *NewCond = State.get(Cond, Part);
   VectorType *VecTy = cast(NewVecOp->getType());
   Constant *Iden = RecurrenceDescriptor::getRecurrenceIdentity(
   Kind, RdxDesc->getMinMaxRecurrenceKind(), VecTy->getElementType());
@@ -8106,7 +8106,7 @@ void VPReductionRecipe::execute(VPTransformState ) {
 }
 Value *NewRed =
 createTargetReduction(State.Builder, TTI, *RdxDesc, NewVecOp, NoNaN);
-Value *PrevInChain = State.get(ChainOp, Part);
+Value *PrevInChain = State.get(getChainOp(), Part);
 Value *NextInChain;
 if (Kind == RecurrenceDescriptor::RK_IntegerMinMax ||
 Kind == RecurrenceDescriptor::RK_FloatMinMax) {
@@ -8115,9 +8115,10 @@ void VPReductionRecipe::execute(VPTransformState ) 
{
  NewRed, PrevInChain);
 } else {
   NextInChain = State.Builder.CreateBinOp(
-  (Instruction::BinaryOps)I->getOpcode(), NewRed, PrevInChain);
+  (Instruction::BinaryOps)getUnderlyingInstr()->getOpcode(), NewRed,
+  PrevInChain);
 }
-State.ValueMap.setVectorValue(I, Part, NextInChain);
+State.set(this, getUnderlyingInstr(), NextInChain, Part);
   }
 }
 

diff  --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp 
b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index f7df8a4fb0e6..08ebafbf12cc 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -22,6 +22,7 @@
 #include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Twine.h"
+#include "llvm/Analysis/IVDescriptors.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/IR/BasicBlock.h"
 #include "llvm/IR/CFG.h"
@@ -110,12 +111,16 @@ VPUser *VPRecipeBase::toVPUser() {
 return U;
   if (auto *U = dyn_cast(this))
 return U;
+  if (auto *U = dyn_cast(this))
+return U;
   return nullptr;
 }
 
 VPValue *VPRecipeBase::toVPValue() {
   if (auto *V = dyn_cast(this))
 return V;
+  if (auto *V = dyn_cast(this))
+return V;
   if (auto *V = dyn_cast(this))
 return V;
   if (auto *V = dyn_cast(this))
@@ -130,6 +135,8 @@ VPValue *VPRecipeBase::toVPValue() {
 const VPValue *VPRecipeBase::toVPValue() const {
   if (auto *V = dyn_cast(this))
 return V;
+  if (auto *V = dyn_cast(this))
+return V;
   if (auto *V = dyn_cast(this))
 return V;
   if (auto *V = dyn_cast(this))
@@ -932,13 +939,16 @@ void VPBlendRecipe::print(raw_ostream , const Twine 
,
 
 void VPReductionRecipe::print(raw_ostream , const Twine ,
   VPSlotTracker ) const {
-  O << "\"REDUCE of" << *I << " as ";
-  ChainOp->printAsOperand(O, SlotTracker);
-  O << " + reduce(";
-  VecOp->printAsOperand(O, SlotTracker);
-  if (CondOp) {
+  O << "\"REDUCE ";
+  printAsOperand(O, SlotTracker);
+  O << " = ";
+  getChainOp()->printAsOperand(O, SlotTracker);
+  O << " + reduce." << 
Instruction::getOpcodeName(RdxDesc->getRecurrenceBinOp())
+<< " (";
+  getVecOp()->printAsOperand(O, SlotTracker);
+  if (getCondOp()) {
 O << ", ";
-CondOp->printAsOperand(O, SlotTracker);
+

[llvm-branch-commits] [llvm] c8c3a41 - [ARM] Ensure MVE_TwoOpPattern is used inside Predicate's

2020-11-22 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-22T21:38:00Z
New Revision: c8c3a411c50f541ce5362bd60ee3f8fe43ac2722

URL: 
https://github.com/llvm/llvm-project/commit/c8c3a411c50f541ce5362bd60ee3f8fe43ac2722
DIFF: 
https://github.com/llvm/llvm-project/commit/c8c3a411c50f541ce5362bd60ee3f8fe43ac2722.diff

LOG: [ARM] Ensure MVE_TwoOpPattern is used inside Predicate's

Added: 


Modified: 
llvm/lib/Target/ARM/ARMInstrMVE.td

Removed: 




diff  --git a/llvm/lib/Target/ARM/ARMInstrMVE.td 
b/llvm/lib/Target/ARM/ARMInstrMVE.td
index 66a6d4bd6de0..0f197d57a1f7 100644
--- a/llvm/lib/Target/ARM/ARMInstrMVE.td
+++ b/llvm/lib/Target/ARM/ARMInstrMVE.td
@@ -1962,9 +1962,10 @@ multiclass MVE_VQxDMULH_m {
   def "" : MVE_VQxDMULH_Base;
   defvar Inst = !cast(NAME);
-  defm : MVE_TwoOpPattern;
 
   let Predicates = [HasMVEInt] in {
+defm : MVE_TwoOpPattern;
+
 // Extra unpredicated multiply intrinsic patterns
 def : Pat<(VTI.Vec (unpred_int (VTI.Vec MQPR:$Qm), (VTI.Vec MQPR:$Qn))),
   (VTI.Vec (Inst (VTI.Vec MQPR:$Qm), (VTI.Vec MQPR:$Qn)))>;
@@ -5492,7 +5493,10 @@ class MVE_VxxMUL_qr {
   def "" : MVE_VxxMUL_qr;
-  defm : MVE_TwoOpPatternDup(NAME)>;
+
+  let Predicates = [HasMVEInt] in {
+defm : MVE_TwoOpPatternDup(NAME)>;
+  }
   defm : MVE_vec_scalar_int_pat_m(NAME), VTI, int_unpred, 
int_pred>;
 }
 



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] f3339b9 - [ARM] MVE VABD tests. NFC

2020-11-22 Thread David Green via llvm-branch-commits

Author: David Green
Date: 2020-11-22T21:16:49Z
New Revision: f3339b9f988cb86e32179982266cccf8962f7e45

URL: 
https://github.com/llvm/llvm-project/commit/f3339b9f988cb86e32179982266cccf8962f7e45
DIFF: 
https://github.com/llvm/llvm-project/commit/f3339b9f988cb86e32179982266cccf8962f7e45.diff

LOG: [ARM] MVE VABD tests. NFC

Added: 
llvm/test/CodeGen/Thumb2/mve-vabdus.ll

Modified: 


Removed: 




diff  --git a/llvm/test/CodeGen/Thumb2/mve-vabdus.ll 
b/llvm/test/CodeGen/Thumb2/mve-vabdus.ll
new file mode 100644
index ..cb82f9020d34
--- /dev/null
+++ b/llvm/test/CodeGen/Thumb2/mve-vabdus.ll
@@ -0,0 +1,942 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve %s -o - | 
FileCheck %s
+
+define arm_aapcs_vfpcc <16 x i8> @vabd_s8(<16 x i8> %src1, <16 x i8> %src2) {
+; CHECK-LABEL: vabd_s8:
+; CHECK:   @ %bb.0:
+; CHECK-NEXT:vmov.u8 r0, q1[0]
+; CHECK-NEXT:vmov.16 q2[0], r0
+; CHECK-NEXT:vmov.u8 r0, q1[1]
+; CHECK-NEXT:vmov.16 q2[1], r0
+; CHECK-NEXT:vmov.u8 r0, q1[2]
+; CHECK-NEXT:vmov.16 q2[2], r0
+; CHECK-NEXT:vmov.u8 r0, q1[3]
+; CHECK-NEXT:vmov.16 q2[3], r0
+; CHECK-NEXT:vmov.u8 r0, q1[4]
+; CHECK-NEXT:vmov.16 q2[4], r0
+; CHECK-NEXT:vmov.u8 r0, q1[5]
+; CHECK-NEXT:vmov.16 q2[5], r0
+; CHECK-NEXT:vmov.u8 r0, q1[6]
+; CHECK-NEXT:vmov.16 q2[6], r0
+; CHECK-NEXT:vmov.u8 r0, q1[7]
+; CHECK-NEXT:vmov.16 q2[7], r0
+; CHECK-NEXT:vmov.u8 r0, q0[0]
+; CHECK-NEXT:vmov.16 q3[0], r0
+; CHECK-NEXT:vmov.u8 r0, q0[1]
+; CHECK-NEXT:vmov.16 q3[1], r0
+; CHECK-NEXT:vmov.u8 r0, q0[2]
+; CHECK-NEXT:vmov.16 q3[2], r0
+; CHECK-NEXT:vmov.u8 r0, q0[3]
+; CHECK-NEXT:vmov.16 q3[3], r0
+; CHECK-NEXT:vmov.u8 r0, q0[4]
+; CHECK-NEXT:vmov.16 q3[4], r0
+; CHECK-NEXT:vmov.u8 r0, q0[5]
+; CHECK-NEXT:vmov.16 q3[5], r0
+; CHECK-NEXT:vmov.u8 r0, q0[6]
+; CHECK-NEXT:vmov.16 q3[6], r0
+; CHECK-NEXT:vmov.u8 r0, q0[7]
+; CHECK-NEXT:vmov.16 q3[7], r0
+; CHECK-NEXT:vmovlb.s8 q2, q2
+; CHECK-NEXT:vmovlb.s8 q3, q3
+; CHECK-NEXT:vsub.i16 q2, q3, q2
+; CHECK-NEXT:vabs.s16 q3, q2
+; CHECK-NEXT:vmov.u16 r0, q3[0]
+; CHECK-NEXT:vmov.8 q2[0], r0
+; CHECK-NEXT:vmov.u16 r0, q3[1]
+; CHECK-NEXT:vmov.8 q2[1], r0
+; CHECK-NEXT:vmov.u16 r0, q3[2]
+; CHECK-NEXT:vmov.8 q2[2], r0
+; CHECK-NEXT:vmov.u16 r0, q3[3]
+; CHECK-NEXT:vmov.8 q2[3], r0
+; CHECK-NEXT:vmov.u16 r0, q3[4]
+; CHECK-NEXT:vmov.8 q2[4], r0
+; CHECK-NEXT:vmov.u16 r0, q3[5]
+; CHECK-NEXT:vmov.8 q2[5], r0
+; CHECK-NEXT:vmov.u16 r0, q3[6]
+; CHECK-NEXT:vmov.8 q2[6], r0
+; CHECK-NEXT:vmov.u16 r0, q3[7]
+; CHECK-NEXT:vmov.8 q2[7], r0
+; CHECK-NEXT:vmov.u8 r0, q1[8]
+; CHECK-NEXT:vmov.16 q3[0], r0
+; CHECK-NEXT:vmov.u8 r0, q1[9]
+; CHECK-NEXT:vmov.16 q3[1], r0
+; CHECK-NEXT:vmov.u8 r0, q1[10]
+; CHECK-NEXT:vmov.16 q3[2], r0
+; CHECK-NEXT:vmov.u8 r0, q1[11]
+; CHECK-NEXT:vmov.16 q3[3], r0
+; CHECK-NEXT:vmov.u8 r0, q1[12]
+; CHECK-NEXT:vmov.16 q3[4], r0
+; CHECK-NEXT:vmov.u8 r0, q1[13]
+; CHECK-NEXT:vmov.16 q3[5], r0
+; CHECK-NEXT:vmov.u8 r0, q1[14]
+; CHECK-NEXT:vmov.16 q3[6], r0
+; CHECK-NEXT:vmov.u8 r0, q1[15]
+; CHECK-NEXT:vmov.16 q3[7], r0
+; CHECK-NEXT:vmov.u8 r0, q0[8]
+; CHECK-NEXT:vmovlb.s8 q1, q3
+; CHECK-NEXT:vmov.16 q3[0], r0
+; CHECK-NEXT:vmov.u8 r0, q0[9]
+; CHECK-NEXT:vmov.16 q3[1], r0
+; CHECK-NEXT:vmov.u8 r0, q0[10]
+; CHECK-NEXT:vmov.16 q3[2], r0
+; CHECK-NEXT:vmov.u8 r0, q0[11]
+; CHECK-NEXT:vmov.16 q3[3], r0
+; CHECK-NEXT:vmov.u8 r0, q0[12]
+; CHECK-NEXT:vmov.16 q3[4], r0
+; CHECK-NEXT:vmov.u8 r0, q0[13]
+; CHECK-NEXT:vmov.16 q3[5], r0
+; CHECK-NEXT:vmov.u8 r0, q0[14]
+; CHECK-NEXT:vmov.16 q3[6], r0
+; CHECK-NEXT:vmov.u8 r0, q0[15]
+; CHECK-NEXT:vmov.16 q3[7], r0
+; CHECK-NEXT:vmovlb.s8 q0, q3
+; CHECK-NEXT:vsub.i16 q0, q0, q1
+; CHECK-NEXT:vabs.s16 q0, q0
+; CHECK-NEXT:vmov.u16 r0, q0[0]
+; CHECK-NEXT:vmov.8 q2[8], r0
+; CHECK-NEXT:vmov.u16 r0, q0[1]
+; CHECK-NEXT:vmov.8 q2[9], r0
+; CHECK-NEXT:vmov.u16 r0, q0[2]
+; CHECK-NEXT:vmov.8 q2[10], r0
+; CHECK-NEXT:vmov.u16 r0, q0[3]
+; CHECK-NEXT:vmov.8 q2[11], r0
+; CHECK-NEXT:vmov.u16 r0, q0[4]
+; CHECK-NEXT:vmov.8 q2[12], r0
+; CHECK-NEXT:vmov.u16 r0, q0[5]
+; CHECK-NEXT:vmov.8 q2[13], r0
+; CHECK-NEXT:vmov.u16 r0, q0[6]
+; CHECK-NEXT:vmov.8 q2[14], r0
+; CHECK-NEXT:vmov.u16 r0, q0[7]
+; CHECK-NEXT:vmov.8 q2[15], r0
+; CHECK-NEXT:vmov q0, q2
+; CHECK-NEXT:bx lr
+  %sextsrc1 = sext <16 x i8> %src1 to <16 x i16>
+  %sextsrc2 = sext <16 x i8> %src2 to <16 x i16>
+  %add1 = sub <16 x i16> %sextsrc1, %sextsrc2
+  %add2 = sub <16 x