[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-07-23 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

### Merge activity

* **Jul 23, 4:02 AM EDT**: @cdevadas started a stack merge that includes this 
pull request via 
[Graphite](https://app.graphite.dev/github/pr/llvm/llvm-project/96163).


https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-23 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

### Merge activity

* **Jul 23, 4:02 AM EDT**: @cdevadas started a stack merge that includes this 
pull request via 
[Graphite](https://app.graphite.dev/github/pr/llvm/llvm-project/96162).


https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-07-22 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

The latest patch optimizes the PatFrag and the patterns written further by 
using OtherPredicates. The lit test changes in the latest patch are a missed 
optimization I incorrectly introduced earlier in this PR for GFX7. It is now 
fixed and matches the default behavior with the current compiler.

https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-17 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

> I still think it is terribly surprising all of the test diff shows up in this 
> commit, and not the selection case

Because the selection support is done in the next PR of the review stack, 
https://github.com/llvm/llvm-project/pull/96162. This patch takes care of 
choosing the right opcode while merging the loads.

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-10 Thread Christudasan Devadasan via llvm-branch-commits

https://github.com/cdevadas edited 
https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-10 Thread Christudasan Devadasan via llvm-branch-commits


@@ -658,17 +658,17 @@ define amdgpu_kernel void 
@image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr,
 ;
 ; GFX1013-LABEL: image_bvh_intersect_ray_nsa_reassign:
 ; GFX1013:   ; %bb.0:
-; GFX1013-NEXT:s_load_dwordx8 s[0:7], s[0:1], 0x24
+; GFX1013-NEXT:s_load_dwordx8 s[4:11], s[0:1], 0x24

cdevadas wrote:

> I guess this code changes because xnack is enabled by default for GFX10.1?
Yes.
> Is there anything we could do to add known alignment info here, to avoid the 
> code pessimization?
I'm not sure what can be done for it.



https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-07-10 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

Ping

https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-10 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

Ping

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-04 Thread Christudasan Devadasan via llvm-branch-commits


@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr 
addrspace(3) %ptr, <2 x half>
 define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) 
%ptr, <2 x i16> %data) {
 ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret:
 ; GFX940:   ; %bb.0:
-; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24
+; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24

cdevadas wrote:

Opened https://github.com/llvm/llvm-project/issues/97715.

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-03 Thread Christudasan Devadasan via llvm-branch-commits

https://github.com/cdevadas edited 
https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-03 Thread Christudasan Devadasan via llvm-branch-commits

https://github.com/cdevadas edited 
https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-03 Thread Christudasan Devadasan via llvm-branch-commits


@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr 
addrspace(3) %ptr, <2 x half>
 define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) 
%ptr, <2 x i16> %data) {
 ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret:
 ; GFX940:   ; %bb.0:
-; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24
+; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24

cdevadas wrote:

Unfortunately, that's not happening. The IR load-store-vectorizer doesn't 
combine the two loads.
I still see the two loads after the IR vectorizer and they become two loads in 
the selected code. Can this happen because the alignment for the two loads 
differ and the IR vectorizer safely ignores them?

*** IR Dump before Selection ***
define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) 
%ptr, <2 x i16> %data) #0 {
  %local_atomic_fadd_v2bf16_noret.kernarg.segment = call nonnull align 16 
dereferenceable(44) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
  %ptr.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) 
%local_atomic_fadd_v2bf16_noret.kernarg.segment, i64 36, !amdgpu.uniform !0
  **%ptr.load = load ptr addrspace(3), ptr addrspace(4) %ptr.kernarg.offset**, 
align 4, !invariant.load !0
  %data.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) 
%local_atomic_fadd_v2bf16_noret.kernarg.segment, i64 40, !amdgpu.uniform !0
  **%data.load = load <2 x i16>, ptr addrspace(4) %data.kernarg.offset**, align 
8, !invariant.load !0
  %ret = call <2 x i16> @llvm.amdgcn.ds.fadd.v2bf16(ptr addrspace(3) %ptr.load, 
<2 x i16> %data.load)
  ret void
}
# *** IR Dump After selection ***:
# Machine code for function local_atomic_fadd_v2bf16_noret: IsSSA, 
TracksLiveness
Function Live Ins: $sgpr0_sgpr1 in %1

bb.0 (%ir-block.0):
  liveins: $sgpr0_sgpr1
  %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
  %3:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %1:sgpr_64(p4), 36, 0 :: 
(dereferenceable invariant load (s32) from %ir.ptr.kernarg.offset, addrspace 4)
  %4:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %1:sgpr_64(p4), 40, 0 :: 
(dereferenceable invariant load (s32) from %ir.data.kernarg.offset, align 8, 
addrspace 4)
  %5:vgpr_32 = COPY %3:sreg_32_xm0_xexec
  %6:vgpr_32 = COPY %4:sreg_32_xm0_xexec
  DS_PK_ADD_BF16 killed %5:vgpr_32, killed %6:vgpr_32, 0, 0, implicit $m0, 
implicit $exec
  S_ENDPGM 0


https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-03 Thread Christudasan Devadasan via llvm-branch-commits

https://github.com/cdevadas edited 
https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-07-03 Thread Christudasan Devadasan via llvm-branch-commits


@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr 
addrspace(3) %ptr, <2 x half>
 define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) 
%ptr, <2 x i16> %data) {
 ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret:
 ; GFX940:   ; %bb.0:
-; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24
+; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24

cdevadas wrote:

Earlier I wrongly used the dword size (Width) in the the alignment check here 
as Jay pointed out. Now, I fixed it to use Byte size while comparing it with 
the existing alignment of the first load.
https://github.com/llvm/llvm-project/pull/96162/commits/e7e6cbc4abd476a038fd7836e5078565e73d1fe9#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1730

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Enable atomic optimizer for 64 bit divergent values (PR #96934)

2024-06-27 Thread Christudasan Devadasan via llvm-branch-commits


@@ -178,6 +178,20 @@ bool AMDGPUAtomicOptimizerImpl::run(Function ) {
   return Changed;
 }
 
+static bool shouldOptimizeForType(Type *Ty) {
+  switch (Ty->getTypeID()) {
+  case Type::FloatTyID:
+  case Type::DoubleTyID:
+return true;
+  case Type::IntegerTyID: {
+if (Ty->getIntegerBitWidth() == 32 || Ty->getIntegerBitWidth() == 64)

cdevadas wrote:

Get Ty->getIntegerBitWidth() just once outside?

https://github.com/llvm/llvm-project/pull/96934
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-06-24 Thread Christudasan Devadasan via llvm-branch-commits


@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const 
CombineInfo ,
   return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM;
 }
   case S_LOAD_IMM:
-switch (Width) {
-default:
-  return 0;
-case 2:
-  return AMDGPU::S_LOAD_DWORDX2_IMM;
-case 3:
-  return AMDGPU::S_LOAD_DWORDX3_IMM;
-case 4:
-  return AMDGPU::S_LOAD_DWORDX4_IMM;
-case 8:
-  return AMDGPU::S_LOAD_DWORDX8_IMM;
+// For targets that support XNACK replay, use the constrained load opcode.
+if (STI && STI->hasXnackReplay()) {
+  switch (Width) {

cdevadas wrote:

> > currently the alignment is picked from the first MMO and that'd definitely 
> > be smaller than the natural align requirement for the new load
> 
> You don't know that - the alignment in the first MMO will be whatever 
> alignment the compiler could deduce, which could be large, e.g. if the 
> pointer used for the first load was known to have a large alignment.

Are you suggesting to check the alignment in the first MMO and see if it is 
still the preferred alignment for the merge-load? 
Use the _ec if the alignment is found to be smaller than the expected value.

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-06-23 Thread Christudasan Devadasan via llvm-branch-commits


@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const 
CombineInfo ,
   return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM;
 }
   case S_LOAD_IMM:
-switch (Width) {
-default:
-  return 0;
-case 2:
-  return AMDGPU::S_LOAD_DWORDX2_IMM;
-case 3:
-  return AMDGPU::S_LOAD_DWORDX3_IMM;
-case 4:
-  return AMDGPU::S_LOAD_DWORDX4_IMM;
-case 8:
-  return AMDGPU::S_LOAD_DWORDX8_IMM;
+// For targets that support XNACK replay, use the constrained load opcode.
+if (STI && STI->hasXnackReplay()) {
+  switch (Width) {

cdevadas wrote:

I guess currently the merged load is always under-aligned. While combining the 
MMOs, currently the alignment is picked from the first MMO and that'd 
definitely be smaller than the natural align requirement for the new load. 
That's one reason I conservatively want to emit _ec equivalent when XNACK is 
enabled.

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-06-22 Thread Christudasan Devadasan via llvm-branch-commits


@@ -867,13 +867,104 @@ def SMRDBufferImm   : ComplexPattern;
 def SMRDBufferImm32 : ComplexPattern;
 def SMRDBufferSgprImm : ComplexPattern;
 
+class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op 
node:$ptr), [{
+  // Returns true if it is a naturally aligned multi-dword load.
+  LoadSDNode *Ld = cast(N);
+  unsigned Size = Ld->getMemoryVT().getStoreSize();
+  return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size));

cdevadas wrote:

Isn't it 12 >= 16 or 16 >=16?
The PowerOf2Ceil intends to catch the first case where the specified Align is 
smaller than its natural alignment.

https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-06-21 Thread Christudasan Devadasan via llvm-branch-commits


@@ -886,26 +977,17 @@ multiclass SMRD_Pattern  {
   def : GCNPat <
 (smrd_load (SMRDSgpr i64:$sbase, i32:$soffset)),
 (vt (!cast(Instr#"_SGPR") $sbase, $soffset, 0))> {
-let OtherPredicates = [isNotGFX9Plus];
-  }
-  def : GCNPat <
-(smrd_load (SMRDSgpr i64:$sbase, i32:$soffset)),
-(vt (!cast(Instr#"_SGPR_IMM") $sbase, $soffset, 0, 0))> {
-let OtherPredicates = [isGFX9Plus];
+let OtherPredicates = [isGFX6GFX7];
   }
 
-  // 4. SGPR+IMM offset
+  // 4. No offset
   def : GCNPat <
-(smrd_load (SMRDSgprImm i64:$sbase, i32:$soffset, i32:$offset)),
-(vt (!cast(Instr#"_SGPR_IMM") $sbase, $soffset, $offset, 0))> {
-let OtherPredicates = [isGFX9Plus];
+(vt (smrd_load (i64 SReg_64:$sbase))),
+(vt (!cast(Instr#"_IMM") i64:$sbase, 0, 0))> {
+let OtherPredicates = [isGFX6GFX7];
   }
 
-  // 5. No offset
-  def : GCNPat <
-(vt (smrd_load (i64 SReg_64:$sbase))),
-(vt (!cast(Instr#"_IMM") i64:$sbase, 0, 0))
-  >;
+  defm : SMRD_Align_Pattern;

cdevadas wrote:

I was using the predicate for gfx8+ which has the xnack replay support enabled. 
I should instead check if the xnack is on. Will change it.

https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-06-21 Thread Christudasan Devadasan via llvm-branch-commits


@@ -867,13 +867,104 @@ def SMRDBufferImm   : ComplexPattern;
 def SMRDBufferImm32 : ComplexPattern;
 def SMRDBufferSgprImm : ComplexPattern;
 
+class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op 
node:$ptr), [{
+  // Returns true if it is a naturally aligned multi-dword load.
+  LoadSDNode *Ld = cast(N);
+  unsigned Size = Ld->getMemoryVT().getStoreSize();
+  return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size));

cdevadas wrote:

For the DWORDX3 case. Even though the size is 12 Bytes, its natural alignment 
is 16 Bytes. 

https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-06-20 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

> This looks like it is affecting codegen even when xnack is disabled? That 
> should not happen.

It shouldn't. I put the xnack replay subtarget check before using *_ec 
equivalents. See the code here: 
https://github.com/llvm/llvm-project/commit/65eb44327cf32a83dbbf13eb70f9d8c03f3efaef#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-06-20 Thread Christudasan Devadasan via llvm-branch-commits

https://github.com/cdevadas ready_for_review 
https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-06-20 Thread Christudasan Devadasan via llvm-branch-commits

https://github.com/cdevadas ready_for_review 
https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)

2024-06-20 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

> [!WARNING]
> This pull request is not mergeable via GitHub because a downstack PR is 
> open. Once all requirements are satisfied, merge this PR as a stack  href="https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-downstack-mergeability-warning;
>  >on Graphite.
> https://graphite.dev/docs/merge-pull-requests;>Learn more

* **#96163** https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-icon;
 target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="10px" height="10px"/>
* **#96162** https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-icon;
 target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="10px" height="10px"/> 
* **#96161** https://app.graphite.dev/github/pr/llvm/llvm-project/96161?utm_source=stack-comment-icon;
 target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="10px" height="10px"/>
* `main`

This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment;>Learn more about 
stacking.


 Join @cdevadas and the rest of your teammates on https://graphite.dev?utm-source=stack-comment;>https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="11px" height="11px"/> Graphite
  

https://github.com/llvm/llvm-project/pull/96162
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)

2024-06-20 Thread Christudasan Devadasan via llvm-branch-commits

cdevadas wrote:

> [!WARNING]
> This pull request is not mergeable via GitHub because a downstack PR is 
> open. Once all requirements are satisfied, merge this PR as a stack  href="https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-downstack-mergeability-warning;
>  >on Graphite.
> https://graphite.dev/docs/merge-pull-requests;>Learn more

* **#96163** https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-icon;
 target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="10px" height="10px"/> 
* **#96162** https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-icon;
 target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="10px" height="10px"/>
* **#96161** https://app.graphite.dev/github/pr/llvm/llvm-project/96161?utm_source=stack-comment-icon;
 target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="10px" height="10px"/>
* `main`

This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment;>Learn more about 
stacking.


 Join @cdevadas and the rest of your teammates on https://graphite.dev?utm-source=stack-comment;>https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" 
width="11px" height="11px"/> Graphite
  

https://github.com/llvm/llvm-project/pull/96163
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [clang] [llvm] AMDGPU: Remove ds atomic fadd intrinsics (PR #95396)

2024-06-13 Thread Christudasan Devadasan via llvm-branch-commits


@@ -2331,40 +2337,74 @@ static Value *upgradeARMIntrinsicCall(StringRef Name, 
CallBase *CI, Function *F,
   llvm_unreachable("Unknown function for ARM CallBase upgrade.");
 }
 
+// These are expected to have have the arguments:

cdevadas wrote:

```suggestion
// These are expected to have the arguments:
```

https://github.com/llvm/llvm-project/pull/95396
___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] ff8a1ca - [AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses.

2021-01-22 Thread Christudasan Devadasan via llvm-branch-commits

Author: Christudasan Devadasan
Date: 2021-01-22T14:20:59+05:30
New Revision: ff8a1cae181438b97937848060da1efb67117ea4

URL: 
https://github.com/llvm/llvm-project/commit/ff8a1cae181438b97937848060da1efb67117ea4
DIFF: 
https://github.com/llvm/llvm-project/commit/ff8a1cae181438b97937848060da1efb67117ea4.diff

LOG: [AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses.

During instruction selection, there is an inconsistency in choosing
the initial soffset value. With certain early passes, this value is
getting modified and that brought additional fixup during
eliminateFrameIndex to work for all cases. This whole transformation
looks trivial and can be handled better.

This patch clearly defines the initial value for soffset and keeps it
unchanged before eliminateFrameIndex. The initial value must be zero
for MUBUF with a frame index. The non-frame index MUBUF forms that
use a raw offset from SP will have the stack register for soffset.
During frame elimination, the soffset remains zero for entry functions
with zero dynamic allocas and no callsites, or else is updated to the
appropriate frame/stack register.

Also, did some code clean up and made all asserts around soffset
stricter to match.

Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D95071

Added: 


Modified: 
llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-private.mir
llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-store-private.mir
llvm/test/CodeGen/AMDGPU/amdpal-callable.ll
llvm/test/CodeGen/AMDGPU/fold-fi-mubuf.mir
llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll

Removed: 




diff  --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
index 3c66745c0e70..340f4ac6f57a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
@@ -1523,7 +1523,9 @@ std::pair 
AMDGPUDAGToDAGISel::foldFrameIndex(SDValue N) const
   FI ? CurDAG->getTargetFrameIndex(FI->getIndex(), FI->getValueType(0)) : 
N;
 
   // We rebase the base address into an absolute stack address and hence
-  // use constant 0 for soffset.
+  // use constant 0 for soffset. This value must be retained until
+  // frame elimination and eliminateFrameIndex will choose the appropriate
+  // frame register if need be.
   return std::make_pair(TFI, CurDAG->getTargetConstant(0, DL, MVT::i32));
 }
 

diff  --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 7255a061b26b..bd577a6fb8c5 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -3669,13 +3669,9 @@ 
AMDGPUInstructionSelector::selectMUBUFScratchOffen(MachineOperand ) const {
MIB.addReg(HighBits);
  },
  [=](MachineInstrBuilder ) { // soffset
-   const MachineMemOperand *MMO = *MI->memoperands_begin();
-   const MachinePointerInfo  = MMO->getPointerInfo();
-
-   if (isStackPtrRelative(PtrInfo))
- MIB.addReg(Info->getStackPtrOffsetReg());
-   else
- MIB.addImm(0);
+   // Use constant zero for soffset and rely on eliminateFrameIndex
+   // to choose the appropriate frame register if need be.
+   MIB.addImm(0);
  },
  [=](MachineInstrBuilder ) { // offset
MIB.addImm(Offset & 4095);
@@ -3722,15 +3718,9 @@ 
AMDGPUInstructionSelector::selectMUBUFScratchOffen(MachineOperand ) const {
MIB.addReg(VAddr);
},
[=](MachineInstrBuilder ) { // soffset
- // If we don't know this private access is a local stack object, 
it
- // needs to be relative to the entry point's scratch wave offset.
- // TODO: Should split large offsets that don't fit like above.
- // TODO: Don't use scratch wave offset just because the offset
- // didn't fit.
- if (!Info->isEntryFunction() && FI.hasValue())
-   MIB.addReg(Info->getStackPtrOffsetReg());
- else
-   MIB.addImm(0);
+ // Use constant zero for soffset and rely on eliminateFrameIndex
+ // to choose the appropriate frame register if need be.
+ MIB.addImm(0);
},
[=](MachineInstrBuilder ) { // offset
  MIB.addImm(Offset);

diff  --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp 
b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index d22bdb791535..d5fa9afded27 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

[llvm-branch-commits] [llvm] c971bcd - [AMDGPU] Test clean up (NFC)

2021-01-22 Thread Christudasan Devadasan via llvm-branch-commits

Author: Christudasan Devadasan
Date: 2021-01-22T13:38:52+05:30
New Revision: c971bcd2102b905e6469463fb8309ab3f7b2b8f2

URL: 
https://github.com/llvm/llvm-project/commit/c971bcd2102b905e6469463fb8309ab3f7b2b8f2
DIFF: 
https://github.com/llvm/llvm-project/commit/c971bcd2102b905e6469463fb8309ab3f7b2b8f2.diff

LOG: [AMDGPU] Test clean up (NFC)

Added: 


Modified: 
llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll

Removed: 




diff  --git a/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll 
b/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll
index 4fb2625a5221..5e4b5f70de0b 100644
--- a/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll
+++ b/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll
@@ -16,7 +16,7 @@
 ; so eliminateFrameIndex would not adjust the access to use the
 ; correct FP offset.
 
-define amdgpu_kernel void @local_stack_offset_uses_sp(i64 addrspace(1)* %out, 
i8 addrspace(1)* %in) {
+define amdgpu_kernel void @local_stack_offset_uses_sp(i64 addrspace(1)* %out) {
 ; MUBUF-LABEL: local_stack_offset_uses_sp:
 ; MUBUF:   ; %bb.0: ; %entry
 ; MUBUF-NEXT:s_load_dwordx2 s[4:5], s[4:5], 0x0
@@ -106,7 +106,7 @@ entry:
   ret void
 }
 
-define void @func_local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 
addrspace(1)* %in) {
+define void @func_local_stack_offset_uses_sp(i64 addrspace(1)* %out) {
 ; MUBUF-LABEL: func_local_stack_offset_uses_sp:
 ; MUBUF:   ; %bb.0: ; %entry
 ; MUBUF-NEXT:s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)



___
llvm-branch-commits mailing list
llvm-branch-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits


[llvm-branch-commits] [llvm] ae25a39 - AMDGPU/GlobalISel: Enable sret demotion

2021-01-07 Thread Christudasan Devadasan via llvm-branch-commits

Author: Christudasan Devadasan
Date: 2021-01-08T10:56:35+05:30
New Revision: ae25a397e9de833ffbd5d8e3b480086404625cb7

URL: 
https://github.com/llvm/llvm-project/commit/ae25a397e9de833ffbd5d8e3b480086404625cb7
DIFF: 
https://github.com/llvm/llvm-project/commit/ae25a397e9de833ffbd5d8e3b480086404625cb7.diff

LOG: AMDGPU/GlobalISel: Enable sret demotion

Added: 


Modified: 
llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
llvm/test/CodeGen/AMDGPU/GlobalISel/function-returns.ll
llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-call-return-values.ll

Removed: 




diff  --git a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h 
b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
index dff73d185114..57ff3900ef25 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
@@ -358,8 +358,8 @@ class CallLowering {
   /// described by \p Outs can fit into the return registers. If false
   /// is returned, an sret-demotion is performed.
   virtual bool canLowerReturn(MachineFunction , CallingConv::ID CallConv,
-  SmallVectorImpl , bool 
IsVarArg,
-  LLVMContext ) const {
+  SmallVectorImpl ,
+  bool IsVarArg) const {
 return true;
   }
 

diff  --git a/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp 
b/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
index e1591a4bf19b..a6d4ea76 100644
--- a/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
@@ -95,8 +95,7 @@ bool CallLowering::lowerCall(MachineIRBuilder , 
const CallBase ,
 
   SmallVector SplitArgs;
   getReturnInfo(CallConv, RetTy, CB.getAttributes(), SplitArgs, DL);
-  Info.CanLowerReturn =
-  canLowerReturn(MF, CallConv, SplitArgs, IsVarArg, RetTy->getContext());
+  Info.CanLowerReturn = canLowerReturn(MF, CallConv, SplitArgs, IsVarArg);
 
   if (!Info.CanLowerReturn) {
 // Callee requires sret demotion.
@@ -592,8 +591,7 @@ bool 
CallLowering::checkReturnTypeForCallConv(MachineFunction ) const {
   SmallVector SplitArgs;
   getReturnInfo(CallConv, ReturnType, F.getAttributes(), SplitArgs,
 MF.getDataLayout());
-  return canLowerReturn(MF, CallConv, SplitArgs, F.isVarArg(),
-ReturnType->getContext());
+  return canLowerReturn(MF, CallConv, SplitArgs, F.isVarArg());
 }
 
 bool CallLowering::analyzeArgInfo(CCState ,

diff  --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
index 3b6e263ee6d8..b86052e3a14e 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
@@ -20,6 +20,7 @@
 #include "SIMachineFunctionInfo.h"
 #include "SIRegisterInfo.h"
 #include "llvm/CodeGen/Analysis.h"
+#include "llvm/CodeGen/FunctionLoweringInfo.h"
 #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h"
 #include "llvm/IR/IntrinsicsAMDGPU.h"
 
@@ -420,6 +421,22 @@ static void unpackRegsToOrigType(MachineIRBuilder ,
   B.buildUnmerge(UnmergeResults, UnmergeSrc);
 }
 
+bool AMDGPUCallLowering::canLowerReturn(MachineFunction ,
+CallingConv::ID CallConv,
+SmallVectorImpl ,
+bool IsVarArg) const {
+  // For shaders. Vector types should be explicitly handled by CC.
+  if (AMDGPU::isEntryFunctionCC(CallConv))
+return true;
+
+  SmallVector ArgLocs;
+  const SITargetLowering  = *getTLI();
+  CCState CCInfo(CallConv, IsVarArg, MF, ArgLocs,
+ MF.getFunction().getContext());
+
+  return checkReturn(CCInfo, Outs, TLI.CCAssignFnForReturn(CallConv, 
IsVarArg));
+}
+
 /// Lower the return value for the already existing \p Ret. This assumes that
 /// \p B's insertion point is correct.
 bool AMDGPUCallLowering::lowerReturnVal(MachineIRBuilder ,
@@ -533,7 +550,9 @@ bool AMDGPUCallLowering::lowerReturn(MachineIRBuilder , 
const Value *Val,
 Ret.addUse(ReturnAddrVReg);
   }
 
-  if (!lowerReturnVal(B, Val, VRegs, Ret))
+  if (!FLI.CanLowerReturn)
+insertSRetStores(B, Val->getType(), VRegs, FLI.DemoteRegister);
+  else if (!lowerReturnVal(B, Val, VRegs, Ret))
 return false;
 
   if (ReturnOpc == AMDGPU::S_SETPC_B64_return) {
@@ -872,6 +891,11 @@ bool AMDGPUCallLowering::lowerFormalArguments(
   unsigned Idx = 0;
   unsigned PSInputNum = 0;
 
+  // Insert the hidden sret parameter if the return value won't fit in the
+  // return registers.
+  if (!FLI.CanLowerReturn)
+insertSRetIncomingArgument(F, SplitArgs, FLI.DemoteRegister, MRI, DL);
+
   for (auto  : F.args()) {
 if (DL.getTypeStoreSize(Arg.getType()) == 0)
   continue;
@@ 

[llvm-branch-commits] [llvm] d68458b - [GlobalISel] Base implementation for sret demotion.

2021-01-05 Thread Christudasan Devadasan via llvm-branch-commits

Author: Christudasan Devadasan
Date: 2021-01-06T10:30:50+05:30
New Revision: d68458bd56d9d55b05fca5447891aa8752d70509

URL: 
https://github.com/llvm/llvm-project/commit/d68458bd56d9d55b05fca5447891aa8752d70509
DIFF: 
https://github.com/llvm/llvm-project/commit/d68458bd56d9d55b05fca5447891aa8752d70509.diff

LOG: [GlobalISel] Base implementation for sret demotion.

If the return values can't be lowered to registers
SelectionDAG performs the sret demotion. This patch
contains the basic implementation for the same in
the GlobalISel pipeline.

Furthermore, targets should bring relevant changes
during lowerFormalArguments, lowerReturn and
lowerCall to make use of this feature.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D92953

Added: 


Modified: 
llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp
llvm/lib/Target/AArch64/GISel/AArch64CallLowering.cpp
llvm/lib/Target/AArch64/GISel/AArch64CallLowering.h
llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp
llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h
llvm/lib/Target/ARM/ARMCallLowering.cpp
llvm/lib/Target/ARM/ARMCallLowering.h
llvm/lib/Target/Mips/MipsCallLowering.cpp
llvm/lib/Target/Mips/MipsCallLowering.h
llvm/lib/Target/PowerPC/GISel/PPCCallLowering.cpp
llvm/lib/Target/PowerPC/GISel/PPCCallLowering.h
llvm/lib/Target/RISCV/RISCVCallLowering.cpp
llvm/lib/Target/RISCV/RISCVCallLowering.h
llvm/lib/Target/X86/X86CallLowering.cpp
llvm/lib/Target/X86/X86CallLowering.h
llvm/tools/llvm-exegesis/lib/Assembler.cpp

Removed: 




diff  --git a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h 
b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
index dbd7e00c429a..dff73d185114 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h
@@ -19,6 +19,7 @@
 #include "llvm/CodeGen/CallingConvLower.h"
 #include "llvm/CodeGen/MachineOperand.h"
 #include "llvm/CodeGen/TargetCallingConv.h"
+#include "llvm/IR/Attributes.h"
 #include "llvm/IR/CallingConv.h"
 #include "llvm/IR/Type.h"
 #include "llvm/Support/ErrorHandling.h"
@@ -31,6 +32,7 @@ namespace llvm {
 class CallBase;
 class DataLayout;
 class Function;
+class FunctionLoweringInfo;
 class MachineIRBuilder;
 struct MachinePointerInfo;
 class MachineRegisterInfo;
@@ -42,21 +44,30 @@ class CallLowering {
 
   virtual void anchor();
 public:
-  struct ArgInfo {
+  struct BaseArgInfo {
+Type *Ty;
+SmallVector Flags;
+bool IsFixed;
+
+BaseArgInfo(Type *Ty,
+ArrayRef Flags = ArrayRef(),
+bool IsFixed = true)
+: Ty(Ty), Flags(Flags.begin(), Flags.end()), IsFixed(IsFixed) {}
+
+BaseArgInfo() : Ty(nullptr), IsFixed(false) {}
+  };
+
+  struct ArgInfo : public BaseArgInfo {
 SmallVector Regs;
 // If the argument had to be split into multiple parts according to the
 // target calling convention, then this contains the original vregs
 // if the argument was an incoming arg.
 SmallVector OrigRegs;
-Type *Ty;
-SmallVector Flags;
-bool IsFixed;
 
 ArgInfo(ArrayRef Regs, Type *Ty,
 ArrayRef Flags = ArrayRef(),
 bool IsFixed = true)
-: Regs(Regs.begin(), Regs.end()), Ty(Ty),
-  Flags(Flags.begin(), Flags.end()), IsFixed(IsFixed) {
+: BaseArgInfo(Ty, Flags, IsFixed), Regs(Regs.begin(), Regs.end()) {
   if (!Regs.empty() && Flags.empty())
 this->Flags.push_back(ISD::ArgFlagsTy());
   // FIXME: We should have just one way of saying "no register".
@@ -65,7 +76,7 @@ class CallLowering {
  "only void types should have no register");
 }
 
-ArgInfo() : Ty(nullptr), IsFixed(false) {}
+ArgInfo() : BaseArgInfo() {}
   };
 
   struct CallLoweringInfo {
@@ -101,6 +112,15 @@ class CallLowering {
 
 /// True if the call is to a vararg function.
 bool IsVarArg = false;
+
+/// True if the function's return value can be lowered to registers.
+bool CanLowerReturn = true;
+
+/// VReg to hold the hidden sret parameter.
+Register DemoteRegister;
+
+/// The stack index for sret demotion.
+int DemoteStackIndex;
   };
 
   /// Argument handling is mostly uniform between the four places that
@@ -292,20 +312,73 @@ class CallLowering {
 return false;
   }
 
+  /// Load the returned value from the stack into virtual registers in \p 
VRegs.
+  /// It uses the frame index \p FI and the start offset from \p DemoteReg.
+  /// The loaded data size will be determined from \p RetTy.
+  void insertSRetLoads(MachineIRBuilder , Type *RetTy,
+   ArrayRef VRegs, Register DemoteReg,
+   int FI) const;
+
+  /// Store the return value given by \p VRegs into stack starting at the 
offset
+  ///