[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-09 Thread Jeffrey Byrnes via cfe-commits

jrbyrnes wrote:

> We should spent more energy making the scheduler sensible by default, instead 
> of creating all of this complexity.

I would also prefer a more sensible default scheduler, but the driving usecase 
for this is global scheduling. The scheduler is doing inefficient things since 
it is unaware of loop carried dependencies. A generalized solution, then, is 
not feasible due the timeline for that feature. We could try adding some sort 
of ad-hoc heuristic to the scheduler for cases like this, but I don't see how 
that would improve complexity relative to this, and it will likely not produce 
the results the users expect.

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-08 Thread Matt Arsenault via cfe-commits


@@ -2658,21 +2676,102 @@ 
IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const {
   return InvertedMask;
 }
 
+void IGroupLPDAGMutation::addSchedGroupBarrierRules() {
+
+  /// Whether or not the instruction has no true data predecessors
+  /// with opcode \p Opc.
+  class NoOpcDataPred : public InstructionRule {
+  protected:
+unsigned Opc;
+
+  public:
+bool apply(const SUnit *SU, const ArrayRef Collection,
+   SmallVectorImpl ) override {
+  return !std::any_of(
+  SU->Preds.begin(), SU->Preds.end(), [this](const SDep ) {
+return Pred.getKind() == SDep::Data &&
+   Pred.getSUnit()->getInstr()->getOpcode() == Opc;
+  });
+}
+
+NoOpcDataPred(unsigned Opc, const SIInstrInfo *TII, unsigned SGID,
+  bool NeedsCache = false)
+: InstructionRule(TII, SGID, NeedsCache), Opc(Opc) {}
+  };
+
+  /// Whether or not the instruction has no write after read predecessors
+  /// with opcode \p Opc.
+  class NoOpcWARPred final : public InstructionRule {
+  protected:
+unsigned Opc;
+
+  public:
+bool apply(const SUnit *SU, const ArrayRef Collection,
+   SmallVectorImpl ) override {
+  return !std::any_of(
+  SU->Preds.begin(), SU->Preds.end(), [this](const SDep ) {
+return Pred.getKind() == SDep::Anti &&
+   Pred.getSUnit()->getInstr()->getOpcode() == Opc;
+  });
+}
+NoOpcWARPred(unsigned Opc, const SIInstrInfo *TII, unsigned SGID,
+ bool NeedsCache = false)
+: InstructionRule(TII, SGID, NeedsCache), Opc(Opc){};
+  };
+
+  SchedGroupBarrierRuleCallBacks = {
+  [](unsigned SGID, const SIInstrInfo *TII) {
+return std::make_shared(AMDGPU::V_CNDMASK_B32_e64, TII,

arsenm wrote:

There's basically no reason to ever use shared_ptr, something is wrong if it's 
necessary over unique_ptr 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-08 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm edited https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-08 Thread Matt Arsenault via cfe-commits


@@ -1284,7 +1284,29 @@ The AMDGPU backend implements the following LLVM IR 
intrinsics.
|  ``// 5 MFMA``
|  
``__builtin_amdgcn_sched_group_barrier(8, 5, 0)``
 
-  llvm.amdgcn.iglp_opt An **experimental** 
intrinsic for instruction group level parallelism. The intrinsic
+  llvm.amdgcn.sched.group.barrier.rule It has the same behavior as 
sched.group.barrier, except the intrinsic includes a fourth argument:
+
+   - RuleMask : The bitmask of 
rules which are applied to the SchedGroup.
+
+   The RuleMask is handled as 
a 64 bit integer, so 64 rules are encodable with a single mask.
+
+   Users can access the 
intrinsic by specifying the optional fourth argument in sched_group_barrier 
builtin
+
+   |  ``// 1 VMEM read 
invoking rules 1 and 2``
+   |  
``__builtin_amdgcn_sched_group_barrier(32, 1, 0, 3)``
+
+   Currently available rules 
are:
+   - 0x: No rule.
+   - 0x0001: Instructions in 
the SchedGroup must not write to the same register
+ that a previously 
occuring V_CNDMASK_B32_e64 reads from.
+   - 0x0002: Instructions in 
the SchedGroup must not write to the same register
+ that a previously 
occuring V_PERM_B32_e64 reads from.
+   - 0x0004: Instructions in 
the SchedGroup must require data produced by a
+ V_CNDMASK_B32_e64.
+   - 0x0008: Instructions in 
the SchedGroup must require data produced by a
+ V_PERM_B32_e64.
+

arsenm wrote:

These scheduling rules seem way too specific. Especially that it's pointing out 
specific instruction encodings, by the internal pseudoinstruction names 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-08 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm commented:

I don't understand how anyone is supposed to use this. This is exposing 
extremely specific, random low level details of the scheduling. Users claim 
they want scheduling controls, but what they actually want is the scheduler to 
just do the right thing. We should spent more energy making the scheduler 
sensible by default, instead of creating all of this complexity.

If we're going to have something like this, it needs to have predefined macros 
instead of expecting reading 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-01 Thread Jeffrey Byrnes via cfe-commits


@@ -437,16 +437,18 @@ void test_sched_group_barrier()
 }
 
 // CHECK-LABEL: @test_sched_group_barrier_rule
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i32 0)
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i32 0)
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 
16, i32 100)
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, 
i32 -1, i32 -100)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i64 1)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i64 1)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i64 -9223372036854775808)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, 
i64 255)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, 
i64 1)
 void test_sched_group_barrier_rule()
 {
   __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0);
   __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0);
-  __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100);
-  __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100);
+  __builtin_amdgcn_sched_group_barrier(1, 2, 4, 63);
+  __builtin_amdgcn_sched_group_barrier(2, 4, 6, 0, 1, 2, 3, 4, 5, 6, 7);

jrbyrnes wrote:

Do you prefer having the latest iteration wherein users provide a mask instead 
of the variadic ? 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-05-01 Thread Jeffrey Byrnes via cfe-commits

https://github.com/jrbyrnes updated 
https://github.com/llvm/llvm-project/pull/85304

>From 04dc59ff7757dea18e2202d1cbff1d675885fdae Mon Sep 17 00:00:00 2001
From: Jeffrey Byrnes 
Date: Tue, 12 Mar 2024 10:22:24 -0700
Subject: [PATCH 1/4] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to
 support rules.

Change-Id: Id8460dc42f41575760793c0fc70e0bc0aecc0d5e
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   2 +-
 clang/lib/CodeGen/CGBuiltin.cpp   |  17 +++
 clang/test/CodeGenOpenCL/builtins-amdgcn.cl   |  14 +++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td  |  15 ++-
 llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | 112 --
 llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp  |  14 +++
 llvm/lib/Target/AMDGPU/SIInstructions.td  |  16 +++
 llvm/lib/Target/AMDGPU/SIPostRABundler.cpp|   3 +-
 .../AMDGPU/llvm.amdgcn.sched.group.barrier.ll |  25 
 9 files changed, 202 insertions(+), 16 deletions(-)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 61ec8b79bf054d..f7b6a4610bd80a 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")
 BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")
-BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n")
+BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n")
 BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 528a13fb275124..4bf71c7535db63 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_grid_size_z:
 return EmitAMDGPUGridSize(*this, 2);
 
+  // scheduling builtins
+  case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: {
+return E->getNumArgs() == 3
+   ? Builder.CreateCall(
+ CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2))})
+   : Builder.CreateCall(
+ CGM.getIntrinsic(
+ Intrinsic::amdgcn_sched_group_barrier_rule),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2)),
+  EmitScalarExpr(E->getArg(3))});
+  }
+
   // r600 intrinsics
   case AMDGPU::BI__builtin_r600_recipsqrt_ieee:
   case AMDGPU::BI__builtin_r600_recipsqrt_ieeef:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index 8a4533633706b2..e28e0a6987484b 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -436,6 +436,20 @@ void test_sched_group_barrier()
   __builtin_amdgcn_sched_group_barrier(15, 1, -1);
 }
 
+// CHECK-LABEL: @test_sched_group_barrier_rule
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 
16, i32 100)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, 
i32 -1, i32 -100)
+void test_sched_group_barrier_rule()
+{
+  __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0);
+  __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0);
+  __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100);
+  __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100);
+}
+
+
 // CHECK-LABEL: @test_iglp_opt
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0)
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 051e603c0819d2..68fe42a8f04d21 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier : 
ClangBuiltin<"__builtin_amdgcn_sched_barrier">,
 // matching instructions that will be associated with this sched_group_barrier.
 // The third parameter is an identifier which is used to describe what other
 // sched_group_barriers should be synchronized with.
-def int_amdgcn_sched_group_barrier : 
ClangBuiltin<"__builtin_amdgcn_sched_group_barrier">,
-  Intrinsic<[], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
-  [ImmArg>, ImmArg>, ImmArg>, IntrNoMem, 
IntrHasSideEffects,
-   IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+multiclass 

[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-24 Thread Matt Arsenault via cfe-commits


@@ -18763,19 +18763,28 @@ Value 
*CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
 
   // scheduling builtins
   case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: {
-return E->getNumArgs() == 3
-   ? Builder.CreateCall(
- CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
- {EmitScalarExpr(E->getArg(0)),
-  EmitScalarExpr(E->getArg(1)),
-  EmitScalarExpr(E->getArg(2))})
-   : Builder.CreateCall(
- CGM.getIntrinsic(
- Intrinsic::amdgcn_sched_group_barrier_rule),
- {EmitScalarExpr(E->getArg(0)),
-  EmitScalarExpr(E->getArg(1)),
-  EmitScalarExpr(E->getArg(2)),
-  EmitScalarExpr(E->getArg(3))});
+if (E->getNumArgs() == 3)
+  return Builder.CreateCall(
+  CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
+  {EmitScalarExpr(E->getArg(0)), EmitScalarExpr(E->getArg(1)),
+   EmitScalarExpr(E->getArg(2))});
+
+uint64_t Mask = 0;
+for (unsigned I = 3; I < E->getNumArgs(); I++) {
+  auto NextArg = EmitScalarExpr(E->getArg(I));
+  auto ArgLiteral = cast(NextArg)->getZExtValue();
+  if (ArgLiteral > 63) {
+CGM.Error(E->getExprLoc(),
+  getContext().BuiltinInfo.getName(BuiltinID).str() +
+  " RuleID must be within [0,63].");

arsenm wrote:

Should such checks go in Sema instead? 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-24 Thread Matt Arsenault via cfe-commits


@@ -18763,19 +18763,28 @@ Value 
*CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
 
   // scheduling builtins
   case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: {
-return E->getNumArgs() == 3
-   ? Builder.CreateCall(
- CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
- {EmitScalarExpr(E->getArg(0)),
-  EmitScalarExpr(E->getArg(1)),
-  EmitScalarExpr(E->getArg(2))})
-   : Builder.CreateCall(
- CGM.getIntrinsic(
- Intrinsic::amdgcn_sched_group_barrier_rule),
- {EmitScalarExpr(E->getArg(0)),
-  EmitScalarExpr(E->getArg(1)),
-  EmitScalarExpr(E->getArg(2)),
-  EmitScalarExpr(E->getArg(3))});
+if (E->getNumArgs() == 3)

arsenm wrote:

Braces 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-24 Thread Matt Arsenault via cfe-commits


@@ -0,0 +1,10 @@
+// RUN: %clang_cc1 -O0 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -target-cpu 
gfx90a \
+// RUN:   -verify -S -o - %s
+

arsenm wrote:

-verify tests belong in Sema, there's no codegen here 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-24 Thread Matt Arsenault via cfe-commits


@@ -2747,23 +2749,32 @@ void 
IGroupLPDAGMutation::initSchedGroupBarrierPipelineStage(
   int32_t SGMask = SGB.getOperand(0).getImm();
   int32_t Size = SGB.getOperand(1).getImm();
   int32_t SyncID = SGB.getOperand(2).getImm();
-  std::optional RuleID =
+  std::optional RuleMask =
   (SGB.getOpcode() == AMDGPU::SCHED_GROUP_BARRIER_RULE)
   ? SGB.getOperand(3).getImm()
   : std::optional(std::nullopt);
 
-  // Sanitize the input
-  if (RuleID && (!SchedGroupBarrierRuleCallBacks.size() ||
- *RuleID > (int)(SchedGroupBarrierRuleCallBacks.size() - 1))) {
-RuleID = std::nullopt;
-llvm_unreachable("Bad rule ID!");
-  }
-
   auto SG = [SyncID].emplace_back((SchedGroupMask)SGMask,
 Size, SyncID, DAG, TII);
-  if (RuleID)
-SG->addRule(SchedGroupBarrierRuleCallBacks[*RuleID](SG->getSGID()));
+  // Process the input mask
+  if (RuleMask) {
+uint64_t TheMask = *RuleMask;
+unsigned NextID = 0;
+while (TheMask) {
+  if (!(TheMask & 0x1)) {
+TheMask >>= 1;
+++NextID;
+continue;
+  }
+  if ((!SchedGroupBarrierRuleCallBacks.size() ||
+   NextID > SchedGroupBarrierRuleCallBacks.size() - 1))

arsenm wrote:

the !size() check is redundant? 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-24 Thread Matt Arsenault via cfe-commits


@@ -437,16 +437,18 @@ void test_sched_group_barrier()
 }
 
 // CHECK-LABEL: @test_sched_group_barrier_rule
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i32 0)
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i32 0)
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 
16, i32 100)
-// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, 
i32 -1, i32 -100)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i64 1)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i64 1)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i64 -9223372036854775808)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, 
i64 255)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, 
i64 1)
 void test_sched_group_barrier_rule()
 {
   __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0);
   __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0);
-  __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100);
-  __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100);
+  __builtin_amdgcn_sched_group_barrier(1, 2, 4, 63);
+  __builtin_amdgcn_sched_group_barrier(2, 4, 6, 0, 1, 2, 3, 4, 5, 6, 7);

arsenm wrote:

I have no idea what all these numbers are supposed to mean 

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-23 Thread Jeffrey Byrnes via cfe-commits

jrbyrnes wrote:

Updated the PR as discussed offline.

Support the variadic builtin arg via combining into mask for intrinsic. This 
sort of implies a limit of 64 rules, but we can workaround by add a new 
intrinsic with two masks (to support rules 65-128), and so on.

For now, rules in this PR behave as they do in the existing code (that is, they 
are additional inclusion criteria). Any changes to this will be addressed in 
future PR.

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-23 Thread Jeffrey Byrnes via cfe-commits

https://github.com/jrbyrnes ready_for_review 
https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-04-23 Thread Jeffrey Byrnes via cfe-commits

https://github.com/jrbyrnes updated 
https://github.com/llvm/llvm-project/pull/85304

>From 04dc59ff7757dea18e2202d1cbff1d675885fdae Mon Sep 17 00:00:00 2001
From: Jeffrey Byrnes 
Date: Tue, 12 Mar 2024 10:22:24 -0700
Subject: [PATCH 1/2] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to
 support rules.

Change-Id: Id8460dc42f41575760793c0fc70e0bc0aecc0d5e
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   2 +-
 clang/lib/CodeGen/CGBuiltin.cpp   |  17 +++
 clang/test/CodeGenOpenCL/builtins-amdgcn.cl   |  14 +++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td  |  15 ++-
 llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | 112 --
 llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp  |  14 +++
 llvm/lib/Target/AMDGPU/SIInstructions.td  |  16 +++
 llvm/lib/Target/AMDGPU/SIPostRABundler.cpp|   3 +-
 .../AMDGPU/llvm.amdgcn.sched.group.barrier.ll |  25 
 9 files changed, 202 insertions(+), 16 deletions(-)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 61ec8b79bf054d..f7b6a4610bd80a 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")
 BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")
-BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n")
+BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n")
 BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 528a13fb275124..4bf71c7535db63 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_grid_size_z:
 return EmitAMDGPUGridSize(*this, 2);
 
+  // scheduling builtins
+  case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: {
+return E->getNumArgs() == 3
+   ? Builder.CreateCall(
+ CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2))})
+   : Builder.CreateCall(
+ CGM.getIntrinsic(
+ Intrinsic::amdgcn_sched_group_barrier_rule),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2)),
+  EmitScalarExpr(E->getArg(3))});
+  }
+
   // r600 intrinsics
   case AMDGPU::BI__builtin_r600_recipsqrt_ieee:
   case AMDGPU::BI__builtin_r600_recipsqrt_ieeef:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index 8a4533633706b2..e28e0a6987484b 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -436,6 +436,20 @@ void test_sched_group_barrier()
   __builtin_amdgcn_sched_group_barrier(15, 1, -1);
 }
 
+// CHECK-LABEL: @test_sched_group_barrier_rule
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 
16, i32 100)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, 
i32 -1, i32 -100)
+void test_sched_group_barrier_rule()
+{
+  __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0);
+  __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0);
+  __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100);
+  __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100);
+}
+
+
 // CHECK-LABEL: @test_iglp_opt
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0)
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 051e603c0819d2..68fe42a8f04d21 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier : 
ClangBuiltin<"__builtin_amdgcn_sched_barrier">,
 // matching instructions that will be associated with this sched_group_barrier.
 // The third parameter is an identifier which is used to describe what other
 // sched_group_barriers should be synchronized with.
-def int_amdgcn_sched_group_barrier : 
ClangBuiltin<"__builtin_amdgcn_sched_group_barrier">,
-  Intrinsic<[], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
-  [ImmArg>, ImmArg>, ImmArg>, IntrNoMem, 
IntrHasSideEffects,
-   IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>;
+multiclass 

[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-03-15 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)


Changes

I am still working with the user to define the actual rules, so it is still a 
WIP.  However, this current version contains the main machinery of the feature.

This helps bridge the gap between sched_group_barrier and iglp_opt, enabling 
users (with compiler support) more ability to create the pipelines they want.

In particular, this is aimed at helping control scheduling in blocks with 
loop-carried dependencies. Since this is a global scheduling problem, there is 
no straightforward way to tune the scheduler against these blocks.

---
Full diff: https://github.com/llvm/llvm-project/pull/85304.diff


9 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1-1) 
- (modified) clang/lib/CodeGen/CGBuiltin.cpp (+17) 
- (modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+14) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+11-4) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp (+102-10) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp (+14) 
- (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+16) 
- (modified) llvm/lib/Target/AMDGPU/SIPostRABundler.cpp (+2-1) 
- (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll (+25) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 61ec8b79bf054d..f7b6a4610bd80a 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")
 BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")
-BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n")
+BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n")
 BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 528a13fb275124..4bf71c7535db63 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_grid_size_z:
 return EmitAMDGPUGridSize(*this, 2);
 
+  // scheduling builtins
+  case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: {
+return E->getNumArgs() == 3
+   ? Builder.CreateCall(
+ CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2))})
+   : Builder.CreateCall(
+ CGM.getIntrinsic(
+ Intrinsic::amdgcn_sched_group_barrier_rule),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2)),
+  EmitScalarExpr(E->getArg(3))});
+  }
+
   // r600 intrinsics
   case AMDGPU::BI__builtin_r600_recipsqrt_ieee:
   case AMDGPU::BI__builtin_r600_recipsqrt_ieeef:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index 8a4533633706b2..e28e0a6987484b 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -436,6 +436,20 @@ void test_sched_group_barrier()
   __builtin_amdgcn_sched_group_barrier(15, 1, -1);
 }
 
+// CHECK-LABEL: @test_sched_group_barrier_rule
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 
16, i32 100)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, 
i32 -1, i32 -100)
+void test_sched_group_barrier_rule()
+{
+  __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0);
+  __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0);
+  __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100);
+  __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100);
+}
+
+
 // CHECK-LABEL: @test_iglp_opt
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0)
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 051e603c0819d2..68fe42a8f04d21 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier : 
ClangBuiltin<"__builtin_amdgcn_sched_barrier">,
 // matching instructions that will be associated with this sched_group_barrier.
 // The third parameter is an identifier which is 

[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-03-14 Thread Jeffrey Byrnes via cfe-commits

jrbyrnes wrote:

Supersedes https://github.com/llvm/llvm-project/pull/78775

https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-03-14 Thread Jeffrey Byrnes via cfe-commits

https://github.com/jrbyrnes edited 
https://github.com/llvm/llvm-project/pull/85304
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)

2024-03-14 Thread Jeffrey Byrnes via cfe-commits

https://github.com/jrbyrnes created 
https://github.com/llvm/llvm-project/pull/85304

I am still working with the user to define the actual rules, so it is still a 
WIP.  However, the current version contains the main machinery of the feature.

This helps bridge the gap between sched_group_barrier and iglp_opt, enabling 
users (with compiler support) more ability to create the pipelines they want.

In particular, this is aimed at helping control scheduling in blocks with 
loop-carried dependencies. Since this is a global scheduling problem, there is 
no straightforward way to tune the scheduler against these blocks.

>From 04dc59ff7757dea18e2202d1cbff1d675885fdae Mon Sep 17 00:00:00 2001
From: Jeffrey Byrnes 
Date: Tue, 12 Mar 2024 10:22:24 -0700
Subject: [PATCH] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to
 support rules.

Change-Id: Id8460dc42f41575760793c0fc70e0bc0aecc0d5e
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   2 +-
 clang/lib/CodeGen/CGBuiltin.cpp   |  17 +++
 clang/test/CodeGenOpenCL/builtins-amdgcn.cl   |  14 +++
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td  |  15 ++-
 llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | 112 --
 llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp  |  14 +++
 llvm/lib/Target/AMDGPU/SIInstructions.td  |  16 +++
 llvm/lib/Target/AMDGPU/SIPostRABundler.cpp|   3 +-
 .../AMDGPU/llvm.amdgcn.sched.group.barrier.ll |  25 
 9 files changed, 202 insertions(+), 16 deletions(-)

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 61ec8b79bf054d..f7b6a4610bd80a 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n")
 BUILTIN(__builtin_amdgcn_s_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n")
 BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n")
-BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n")
+BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n")
 BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n")
 BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n")
 BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n")
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index 528a13fb275124..4bf71c7535db63 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
   case AMDGPU::BI__builtin_amdgcn_grid_size_z:
 return EmitAMDGPUGridSize(*this, 2);
 
+  // scheduling builtins
+  case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: {
+return E->getNumArgs() == 3
+   ? Builder.CreateCall(
+ CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2))})
+   : Builder.CreateCall(
+ CGM.getIntrinsic(
+ Intrinsic::amdgcn_sched_group_barrier_rule),
+ {EmitScalarExpr(E->getArg(0)),
+  EmitScalarExpr(E->getArg(1)),
+  EmitScalarExpr(E->getArg(2)),
+  EmitScalarExpr(E->getArg(3))});
+  }
+
   // r600 intrinsics
   case AMDGPU::BI__builtin_r600_recipsqrt_ieee:
   case AMDGPU::BI__builtin_r600_recipsqrt_ieeef:
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
index 8a4533633706b2..e28e0a6987484b 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl
@@ -436,6 +436,20 @@ void test_sched_group_barrier()
   __builtin_amdgcn_sched_group_barrier(15, 1, -1);
 }
 
+// CHECK-LABEL: @test_sched_group_barrier_rule
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, 
i32 0)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 
16, i32 100)
+// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, 
i32 -1, i32 -100)
+void test_sched_group_barrier_rule()
+{
+  __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0);
+  __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0);
+  __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100);
+  __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100);
+}
+
+
 // CHECK-LABEL: @test_iglp_opt
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0)
 // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td 
b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 051e603c0819d2..68fe42a8f04d21 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier :