[clang] [llvm] [AMDGPU] Introduce asyncmark/wait intrinsics (PR #173259)

via cfe-commits Mon, 22 Dec 2025 06:20:57 -0800

llvmbot wrote:


<!--LLVM PR SUMMARY COMMENT-->

@llvm/pr-subscribers-llvm-ir

Author: Sameer Sahasrabuddhe (ssahasra)

<details>
<summary>Changes</summary>

Asynchronous operations are memory transfers (usually between the global memory 
and LDS) that are completed independently at an unspecified scope. A thread 
that requests one or more asynchronous transfers can use *async markers* to 
track their completion. The thread waits for each marker to be *completed*, 
which indicates that requests initiated in program order before this marker 
have also completed.

For now, we implement asyncmark/wait operations on pre-GFX12 architectures that 
support "LDS DMA" operations. These "legacy" operations are now extended to 
accept an optional `ASYNC` parameter as a bit in the auxiliary argument. When 
set, it indicates that the compiler should not automatically track the 
completion of this operation.

Future work will extend support to GFX12Plus architectures that support "true" 
async operations.

Co-authored-by: Ryan Mitchell &lt;ryan.mitchell@<!-- -->amd.com&gt;

Fixes: SWDEV-521121

---

Patch is 73.66 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/173259.diff


15 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+8) 
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark-errs.cl (+7) 
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark.cl (+16) 
- (added) llvm/docs/AMDGPUAsyncOperations.rst (+180) 
- (modified) llvm/docs/AMDGPUUsage.rst (+9-3) 
- (modified) llvm/docs/UserGuides.rst (+4) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+9) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+6) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp (+15-1) 
- (modified) llvm/lib/Target/AMDGPU/SIDefines.h (+5-2) 
- (modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+307-16) 
- (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+4) 
- (modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+12-1) 
- (added) llvm/test/CodeGen/AMDGPU/async-mark-err.ll (+10) 
- (added) llvm/test/CodeGen/AMDGPU/async-mark-pregfx12.ll (+898) 


``````````diff
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 88b306462a92c..972e1580912dc 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -538,6 +538,14 @@ TARGET_BUILTIN(__builtin_amdgcn_ashr_pk_u8_i32, 
"UsUiUiUi", "nc", "ashr-pk-insts
 TARGET_BUILTIN(__builtin_amdgcn_cvt_scalef32_2xpk16_fp6_f32, "V6UiV16fV16ff", 
"nc", "gfx950-insts")
 TARGET_BUILTIN(__builtin_amdgcn_cvt_scalef32_2xpk16_bf6_f32, "V6UiV16fV16ff", 
"nc", "gfx950-insts")
 
+//===----------------------------------------------------------------------===//
+// Async mark builtins.
+//===----------------------------------------------------------------------===//
+
+// FIXME: Not supported on GFX12 yet. Will need a new feature when we do.
+TARGET_BUILTIN(__builtin_amdgcn_asyncmark, "v", "n", "vmem-to-lds-load-insts")
+TARGET_BUILTIN(__builtin_amdgcn_wait_asyncmark, "vIs", "n", 
"vmem-to-lds-load-insts")
+
 
//===----------------------------------------------------------------------===//
 // GFX12+ only builtins.
 
//===----------------------------------------------------------------------===//
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark-errs.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark-errs.cl
new file mode 100644
index 0000000000000..7d4a141fbde6e
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark-errs.cl
@@ -0,0 +1,7 @@
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -O0 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -target-cpu 
gfx1250 -verify -S -o - %s
+
+void test_feature() {
+  __builtin_amdgcn_asyncmark(); // 
expected-error{{'__builtin_amdgcn_asyncmark' needs target feature 
vmem-to-lds-load-insts}}
+  __builtin_amdgcn_wait_asyncmark(0); // 
expected-error{{'__builtin_amdgcn_wait_asyncmark' needs target feature 
vmem-to-lds-load-insts}}
+}
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark.cl
new file mode 100644
index 0000000000000..976ae3cea5d6d
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-asyncmark.cl
@@ -0,0 +1,16 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown 
-target-cpu gfx900 -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown 
-target-cpu gfx942 -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown 
-target-cpu gfx1010 -emit-llvm -o - %s | FileCheck %s
+// REQUIRES: amdgpu-registered-target
+
+// CHECK-LABEL: @test_invocation(
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:    call void @llvm.amdgcn.asyncmark()
+// CHECK-NEXT:    call void @llvm.amdgcn.wait.asyncmark(i16 0)
+// CHECK-NEXT:    ret void
+//
+void test_invocation() {
+  __builtin_amdgcn_asyncmark();
+  __builtin_amdgcn_wait_asyncmark(0);
+}
diff --git a/llvm/docs/AMDGPUAsyncOperations.rst 
b/llvm/docs/AMDGPUAsyncOperations.rst
new file mode 100644
index 0000000000000..006c59d53294c
--- /dev/null
+++ b/llvm/docs/AMDGPUAsyncOperations.rst
@@ -0,0 +1,180 @@
+===============================
+ AMDGPU Asynchronous Operations
+===============================
+
+.. contents::
+   :local:
+
+Introduction
+============
+
+Asynchronous operations are memory transfers (usually between the global memory
+and LDS) that are completed independently at an unspecified scope. A thread 
that
+requests one or more asynchronous transfers can use *async markers* to track
+their completion. The thread waits for each marker to be *completed*, which
+indicates that requests initiated in program order before this marker have also
+completed.
+
+Operations
+==========
+
+``async_load_to_lds``
+---------------------
+
+.. code-block:: llvm
+
+  ; Legacy "LDS DMA" operations
+  void @llvm.amdgcn.load.to.lds(ptr %src, ptr %dst, ASYNC)
+  void @llvm.amdgcn.global.load.lds(ptr %src, ptr %dst, ASYNC)
+  void @llvm.amdgcn.raw.buffer.load.lds(ptr %src, ptr %dst, ASYNC)
+  void @llvm.amdgcn.raw.ptr.buffer.load.lds(ptr %src, ptr %dst, ASYNC)
+  void @llvm.amdgcn.struct.buffer.load.lds(ptr %src, ptr %dst, ASYNC)
+  void @llvm.amdgcn.struct.ptr.buffer.load.lds(ptr %src, ptr %dst, ASYNC)
+
+Requests an async operation that copies the specified number of bytes from the
+global/buffer pointer ``%src`` to the LDS pointer ``%dst``.
+
+The optional parameter `ASYNC` is a bit in the auxiliary argument to those
+intrinsics, as documented in :ref:`LDS DMA operations<amdgpu-lds-dma-bits>`.
+When set, it indicates that the compiler should not automatically track the
+completion of this operation.
+
+``@llvm.amdgcn.asyncmark()``
+----------------------------
+
+Creates an *async marker* to track all the async operations that are program
+ordered before this call. A marker M is said to be *completed* only when all
+async operations program ordered before M are reported by the implementation as
+having finished, and it is said to be *outstanding* otherwise.
+
+Thus we have the following sufficient condition:
+
+  An async operation X is *completed* at a program point P if there exists a
+  marker M such that X is program ordered before M, M is program ordered before
+  P, and M is completed. X is said to be *outstanding* at P otherwise.
+
+``@llvm.amdgcn.wait.asyncmark(i32 %N)``
+---------------------------------------
+
+Waits until the ``N+1`` th predecessor marker M in program order before this
+call is completed, if M exists.
+
+N is an unsigned integer; the ``N+1`` th predecessor marker of point X is a
+marker M such that there are `N` markers in program order from M to X, not
+including M.
+
+Memory Consistency Model
+========================
+
+Each asynchronous operation consists of a non-atomic read on the source and a
+non-atomic write on the destination. Legacy "LDS DMA" intrinsics result in 
async
+accesses that guarantee visibility relative to other memory operations as
+follows:
+
+  The side-effects of an asynchronous operation `A` program ordered before any
+  memory operation `X` are visible to `X` if `A` is completed before `X`.
+
+  The side-effects of any memory operation `X` program ordered before an
+  asynchronous operation `A` are visible to `A`.
+
+Function calls in LLVM
+======================
+
+The underlying abstract machine does not implicitly track the completion of
+async operations while entering or returning from a function call.
+
+.. note::
+
+   As long as the caller uses sufficient wait's to track its own async
+   operations, the actions performed by the callee cannot affect correctness.
+   But the resulting implementation may contain redundant waits, which can be
+   improved by setting the attribute to a value other than `async:unknown`.
+
+Examples
+========
+
+Uneven blocks of async transfers
+--------------------------------
+
+.. code-block:: c++
+
+   void foo(global int *g, local int *l) {
+     // first block
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     asyncmark();
+
+     // second block; longer
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     asyncmark();
+
+     // third block; shorter
+     async_load_to_lds(l, g);
+     async_load_to_lds(l, g);
+     asyncmark();
+
+     // Wait for first block
+     wait.asyncmark(2);
+   }
+
+Software pipeline
+-----------------
+
+.. code-block:: c++
+
+   void foo(global int *g, local int *l) {
+     // first block
+     asyncmark();
+
+     // second block
+     asyncmark();
+
+     // third block
+     asyncmark();
+
+     for (;;) {
+       wait.asyncmark(2);
+       // use data
+
+       // next block
+       asyncmark();
+     }
+
+     // flush one block
+     wait.asyncmark(2);
+
+     // flush one more block
+     wait.asyncmark(1);
+
+     // flush last block
+     wait.asyncmark(0);
+   }
+
+Ordinary function call
+----------------------
+
+.. code-block:: c++
+
+   extern void bar(); // may or may not make async calls
+
+   void foo(global int *g, local int *l) {
+       // first block
+       asyncmark();
+
+       // second block
+       asyncmark();
+
+       // function call
+       bar();
+
+       // third block
+       asyncmark();
+
+       wait.asyncmark(1); // will wait for at least the second block, possibly 
including bar()
+       wait.asyncmark(0); // will wait for third block, including bar()
+   }
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 7ecf1c1124894..691f4c8017a2f 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -6594,12 +6594,18 @@ operations.
 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
 termed vector memory operations.
 
+.. _amdgpu-lds-dma-bits:
+
 ``global_load_lds`` or ``buffer/global_load`` instructions with the `lds` flag
 are LDS DMA loads. They interact with caches as if the loaded data were
 being loaded to registers and not to LDS, and so therefore support the same
-cache modifiers. They cannot be performed atomically. They implement volatile
-(via aux/cpol bit 31) and nontemporal (via metadata) as if they were loads
-from the global address space.
+cache modifiers. They cannot be performed atomically. They can be performed 
with
+asynchronous, volatile and nontemporal semantics as if they were loads
+from the global address space:
+
+- asynchronous: aux bit u0x40000000
+- volatile: aux bit u0x20
+- nontemporal: metadata
 
 Private address space uses ``buffer_load/store`` using the scratch V#
 (GFX6-GFX8), or ``scratch_load/store`` (GFX9-GFX11). Since only a single thread
diff --git a/llvm/docs/UserGuides.rst b/llvm/docs/UserGuides.rst
index d3ca2f69016c1..6b34cc5632d40 100644
--- a/llvm/docs/UserGuides.rst
+++ b/llvm/docs/UserGuides.rst
@@ -18,6 +18,7 @@ intermediate LLVM representation.
    AdvancedBuilds
    AliasAnalysis
    AMDGPUUsage
+   AMDGPUAsyncOperations
    Benchmarking
    BigEndianNEON
    BuildingADistribution
@@ -283,6 +284,9 @@ Additional Topics
 :doc:`AMDGPUUsage`
    This document describes using the AMDGPU backend to compile GPU kernels.
 
+:doc:`AMDGPUAsyncOperations`
+   Builtins for invoking asynchronous data transfer operations in AMD GPUs.
+
 :doc:`AMDGPUDwarfExtensionsForHeterogeneousDebugging`
    This document describes DWARF extensions to support heterogeneous debugging
    for targets such as the AMDGPU backend.
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 19d5f24c5d5e0..ea47fe83ea9ca 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -2855,6 +2855,15 @@ def int_amdgcn_global_load_lds : AMDGPUGlobalLoadLDS;
 def int_amdgcn_pops_exiting_wave_id :
   DefaultAttrsIntrinsic<[llvm_i32_ty], [], [IntrNoMem, IntrHasSideEffects]>;
 
+// Sets a marker in the stream of async requests. Modelled as InaccessibleMem.
+def int_amdgcn_asyncmark : ClangBuiltin<"__builtin_amdgcn_asyncmark">,
+  Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects]>;
+
+// Waits until the Nth previous marker is completed, if it exists.
+def int_amdgcn_wait_asyncmark :
+    ClangBuiltin<"__builtin_amdgcn_wait_asyncmark">,
+    Intrinsic<[], [llvm_i16_ty], [ImmArg<ArgIndex<0>>, IntrNoMem, 
IntrHasSideEffects]>;
+
 
//===----------------------------------------------------------------------===//
 // GFX10 Intrinsics
 
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 5dc7c8327102e..334ba33c20264 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -2375,6 +2375,12 @@ bool 
AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
   case Intrinsic::amdgcn_load_to_lds:
   case Intrinsic::amdgcn_global_load_lds:
     return selectGlobalLoadLds(I);
+  case Intrinsic::amdgcn_asyncmark:
+  case Intrinsic::amdgcn_wait_asyncmark:
+    // FIXME: Not supported on GFX12 yet. Will need a new feature when we do.
+    if (!Subtarget->hasVMemToLDSLoad())
+      return false;
+    break;
   case Intrinsic::amdgcn_exp_compr:
     if (!STI.hasCompressedExport()) {
       Function &F = I.getMF()->getFunction();
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
index bf9b4297bd435..75da5acbc57c7 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
@@ -347,7 +347,7 @@ void AMDGPUAsmPrinter::emitInstruction(const MachineInstr 
*MI) {
     }
   } else {
     // We don't want these pseudo instructions encoded. They are
-    // placeholder terminator instructions and should only be printed as
+    // placeholder instructions and should only be printed as
     // comments.
     if (MI->getOpcode() == AMDGPU::SI_RETURN_TO_EPILOG) {
       if (isVerbose())
@@ -361,6 +361,20 @@ void AMDGPUAsmPrinter::emitInstruction(const MachineInstr 
*MI) {
       return;
     }
 
+    if (MI->getOpcode() == AMDGPU::S_ASYNCMARK) {
+      if (isVerbose())
+        OutStreamer->emitRawComment(" s_asyncmark");
+      return;
+    }
+
+    if (MI->getOpcode() == AMDGPU::S_WAIT_ASYNCMARK) {
+      if (isVerbose()) {
+        OutStreamer->emitRawComment(" s_wait_asyncmark(" +
+                                    Twine(MI->getOperand(0).getImm()) + ")");
+      }
+      return;
+    }
+
     if (MI->getOpcode() == AMDGPU::SCHED_BARRIER) {
       if (isVerbose()) {
         std::string HexString;
diff --git a/llvm/lib/Target/AMDGPU/SIDefines.h 
b/llvm/lib/Target/AMDGPU/SIDefines.h
index 0d206aba33543..6504b0fdae190 100644
--- a/llvm/lib/Target/AMDGPU/SIDefines.h
+++ b/llvm/lib/Target/AMDGPU/SIDefines.h
@@ -368,12 +368,15 @@ enum CPol {
   GLC = 1,
   SLC = 2,
   DLC = 4,
+  SWZ_pregfx12 = 8,
   SCC = 16,
+  ASYNC_pregfx12 = 32,
+
   SC0 = GLC,
   SC1 = SCC,
   NT = SLC,
-  ALL_pregfx12 = GLC | SLC | DLC | SCC,
-  SWZ_pregfx12 = 8,
+  // Bits that should survive in MIR
+  ALL_pregfx12 = GLC | SLC | DLC | SCC | ASYNC_pregfx12,
 
   // Below are GFX12+ cache policy bits
 
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp 
b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index e21583ae0876f..2333b2fc4d460 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -457,6 +457,9 @@ class SIInsertWaitcnts {
   // message.
   DenseSet<MachineInstr *> ReleaseVGPRInsts;
 
+  // Track legacy async instructions to later remove CPol::ASYNC_pregfx12
+  SmallVector<MachineInstr *, 32> InstsWithAsyncCpolBit;
+
   HardwareLimits Limits;
 
 public:
@@ -566,6 +569,35 @@ class SIInsertWaitcnts {
     return VmemReadMapping[getVmemType(Inst)];
   }
 
+  bool hasCPolAsyncBit(const MachineInstr &MI) const {
+    const MachineOperand *CPol = TII->getNamedOperand(MI, 
AMDGPU::OpName::cpol);
+    if (!CPol || !CPol->isImm())
+      return false;
+    return CPol->getImm() & AMDGPU::CPol::ASYNC_pregfx12;
+  }
+
+  // FIXME: For GFX1250, this should also check for usesASYNC_CNT
+  bool isAsync(const MachineInstr &MI) const {
+    if (!SIInstrInfo::isLDSDMA(MI))
+      return false;
+    if (SIInstrInfo::usesASYNC_CNT(MI)) {
+      return true;
+    }
+    return hasCPolAsyncBit(MI);
+  }
+
+  bool isNonAsyncLdsDmaWrite(const MachineInstr &MI) const {
+    if (!SIInstrInfo::mayWriteLDSThroughDMA(MI))
+      return false;
+    return !isAsync(MI);
+  }
+
+  bool isAsyncLdsDmaWrite(const MachineInstr &MI) const {
+    if (!SIInstrInfo::mayWriteLDSThroughDMA(MI))
+      return false;
+    return isAsync(MI);
+  }
+
   bool isVmemAccess(const MachineInstr &MI) const;
   bool generateWaitcntInstBefore(MachineInstr &MI,
                                  WaitcntBrackets &ScoreBrackets,
@@ -653,6 +685,11 @@ class WaitcntBrackets {
     return It != VMem.end() ? It->second.Scores[T] : 0;
   }
 
+  unsigned getClampedWait(InstCounterType T, unsigned ScoreToWait) const {
+    return std::min(getScoreUB(T) - ScoreToWait,
+                    Context->getWaitCountMax(T) - 1);
+  }
+
   bool merge(const WaitcntBrackets &Other);
 
   bool counterOutOfOrder(InstCounterType T) const;
@@ -666,11 +703,13 @@ class WaitcntBrackets {
                                AMDGPU::Waitcnt &Wait) const;
   void determineWaitForLDSDMA(InstCounterType T, VMEMID TID,
                               AMDGPU::Waitcnt &Wait) const;
+  AMDGPU::Waitcnt determineAsyncWait(unsigned N);
   void tryClearSCCWriteEvent(MachineInstr *Inst);
 
   void applyWaitcnt(const AMDGPU::Waitcnt &Wait);
   void applyWaitcnt(InstCounterType T, unsigned Count);
   void updateByEvent(WaitEventType E, MachineInstr &MI);
+  void recordAsyncMark(MachineInstr &MI);
 
   unsigned hasPendingEvent() const { return PendingEvents; }
   unsigned hasPendingEvent(WaitEventType E) const {
@@ -704,10 +743,7 @@ class WaitcntBrackets {
     return LastGDS > ScoreLBs[DS_CNT] && LastGDS <= ScoreUBs[DS_CNT];
   }
 
-  unsigned getPendingGDSWait() const {
-    return std::min(getScoreUB(DS_CNT) - LastGDS,
-                    Context->getWaitCountMax(DS_CNT) - 1);
-  }
+  unsigned getPendingGDSWait() const { return getClampedWait(DS_CNT, LastGDS); 
}
 
   void setPendingGDS() { LastGDS = ScoreUBs[DS_CNT]; }
 
@@ -766,6 +802,9 @@ class WaitcntBrackets {
 
   static bool mergeScore(const MergeInfo &M, unsigned &Score,
                          unsigned OtherScore);
+  bool mergeAsyncMarkers(
+      const MergeInfo MergeInfos[NUM_INST_CNTS],
+      const SmallVectorImpl<std::array<unsigned, NUM_INST_CNTS>> 
&OtherMarkers);
 
   iterator_range<MCRegUnitIterator> regunits(MCPhysReg Reg) const {
     assert(Reg != AMDGPU::SCC && "Shouldn't be used on SCC");
@@ -817,6 +856,8 @@ class WaitcntBrackets {
   void setScoreByOperand(const MachineOperand &Op, InstCounterType CntTy,
                          unsigned Val);
 
+  InstCounterType getAsyncCounterType() const { return LOAD_CNT; }
+
   const SIInsertWaitcnts *Context;
 
   unsigned ScoreLBs[NUM_INST_CNTS] = {0};
@@ -872,6 +913,11 @@ class WaitcntBrackets {
   // Store representative LDS DMA operations. The only useful info here is
   // alias info. One store is kept per unique AAInfo.
   SmallVector<const MachineInstr *> LDSDMAStores;
+
+  SmallVector<std::array<unsigned, NUM_INST_CNTS>> AsyncMarkers;
+  // Track the upper bound score for async operations that are not part of a
+  // marker yet. Initialized to all zeros.
+  std::array<unsigned, NUM_INST_CNTS> AsyncScore{};
 };
 
 class SIInsertWaitcntsLegacy : public MachineFunctionPass {
@@ -1063,7 +1109,7 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, 
MachineInstr &Inst) {
       setScoreByOperand(Op, T, CurrScore);
     }
     if (Inst.mayStore() &&
-        (TII->isDS(Inst) || TII->mayWriteLDSThroughDMA(Inst))) {
+        (TII->isDS(Inst) || (Context->isNonAsyncLdsDmaWrite(Inst)))) {
       // MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS
       // written can be accessed. A load from LDS to VMEM does not need a wait.
       //
@@ -1107,6 +1153,12 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, 
MachineInstr &Inst) {
         setVMemScore(LDSDMA_BEGIN + Slot, T, CurrScore);
     }
 
+    if (Context->isAsyncLdsDmaWrite(Inst) && T == LOAD_CNT) {
+      // FIXME: Not supported on GFX12 yet. Will need a new feature when we do.
+      assert(!SIInstrInfo::usesASYNC_CNT(Inst));
+      AsyncScore[T] = CurrScore;
+    }
+
     if (SIInstrInfo::isSBarrierSCCWrite(Inst.getOpcode())) {
       setRegScore(AMDGPU::SCC, T, CurrScore);
       PendingSCCWrite = &Inst;
@@ -1114,6 +1166,18 @@ void WaitcntBrackets::updateByEvent(WaitEventType E, 
MachineInstr &Inst) {
   }
 }
 
+void WaitcntBrackets::recordAsyncMark(MachineInstr &Inst) {
+  AsyncMarkers.emplace_back(AsyncScore);
+  AsyncScore = {};
+  LLVM_DEBUG({
+    dbgs() << "recordAsyncMark:\n" << Inst;
+    for (const auto &Marker : AsyncMarkers) {
+      llvm::interleaveComma(Marker, dbgs());
+      dbgs() << '\n';
+    }
+  });
+}
+
 void WaitcntBrackets::print(raw_ostream &OS) const {
   const GCNSubtarget *ST = Context->ST;
 
@@ -1207,6 +1271,58 @@ void WaitcntBrackets::print(raw_ostream &OS) const {
   }
   OS << '\n';
 
+  OS << "Async score: ";
+  if (!AsyncScore.size()) {
+    OS << "none...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/173259
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] [llvm] [AMDGPU] Introduce asyncmark/wait intrinsics (PR #173259)

Reply via email to