[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
jrbyrnes wrote: > We should spent more energy making the scheduler sensible by default, instead > of creating all of this complexity. I would also prefer a more sensible default scheduler, but the driving usecase for this is global scheduling. The scheduler is doing inefficient things since it is unaware of loop carried dependencies. A generalized solution, then, is not feasible due the timeline for that feature. We could try adding some sort of ad-hoc heuristic to the scheduler for cases like this, but I don't see how that would improve complexity relative to this, and it will likely not produce the results the users expect. https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -2658,21 +2676,102 @@ IGroupLPDAGMutation::invertSchedBarrierMask(SchedGroupMask Mask) const { return InvertedMask; } +void IGroupLPDAGMutation::addSchedGroupBarrierRules() { + + /// Whether or not the instruction has no true data predecessors + /// with opcode \p Opc. + class NoOpcDataPred : public InstructionRule { + protected: +unsigned Opc; + + public: +bool apply(const SUnit *SU, const ArrayRef Collection, + SmallVectorImpl ) override { + return !std::any_of( + SU->Preds.begin(), SU->Preds.end(), [this](const SDep ) { +return Pred.getKind() == SDep::Data && + Pred.getSUnit()->getInstr()->getOpcode() == Opc; + }); +} + +NoOpcDataPred(unsigned Opc, const SIInstrInfo *TII, unsigned SGID, + bool NeedsCache = false) +: InstructionRule(TII, SGID, NeedsCache), Opc(Opc) {} + }; + + /// Whether or not the instruction has no write after read predecessors + /// with opcode \p Opc. + class NoOpcWARPred final : public InstructionRule { + protected: +unsigned Opc; + + public: +bool apply(const SUnit *SU, const ArrayRef Collection, + SmallVectorImpl ) override { + return !std::any_of( + SU->Preds.begin(), SU->Preds.end(), [this](const SDep ) { +return Pred.getKind() == SDep::Anti && + Pred.getSUnit()->getInstr()->getOpcode() == Opc; + }); +} +NoOpcWARPred(unsigned Opc, const SIInstrInfo *TII, unsigned SGID, + bool NeedsCache = false) +: InstructionRule(TII, SGID, NeedsCache), Opc(Opc){}; + }; + + SchedGroupBarrierRuleCallBacks = { + [](unsigned SGID, const SIInstrInfo *TII) { +return std::make_shared(AMDGPU::V_CNDMASK_B32_e64, TII, arsenm wrote: There's basically no reason to ever use shared_ptr, something is wrong if it's necessary over unique_ptr https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/arsenm edited https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -1284,7 +1284,29 @@ The AMDGPU backend implements the following LLVM IR intrinsics. | ``// 5 MFMA`` | ``__builtin_amdgcn_sched_group_barrier(8, 5, 0)`` - llvm.amdgcn.iglp_opt An **experimental** intrinsic for instruction group level parallelism. The intrinsic + llvm.amdgcn.sched.group.barrier.rule It has the same behavior as sched.group.barrier, except the intrinsic includes a fourth argument: + + - RuleMask : The bitmask of rules which are applied to the SchedGroup. + + The RuleMask is handled as a 64 bit integer, so 64 rules are encodable with a single mask. + + Users can access the intrinsic by specifying the optional fourth argument in sched_group_barrier builtin + + | ``// 1 VMEM read invoking rules 1 and 2`` + | ``__builtin_amdgcn_sched_group_barrier(32, 1, 0, 3)`` + + Currently available rules are: + - 0x: No rule. + - 0x0001: Instructions in the SchedGroup must not write to the same register + that a previously occuring V_CNDMASK_B32_e64 reads from. + - 0x0002: Instructions in the SchedGroup must not write to the same register + that a previously occuring V_PERM_B32_e64 reads from. + - 0x0004: Instructions in the SchedGroup must require data produced by a + V_CNDMASK_B32_e64. + - 0x0008: Instructions in the SchedGroup must require data produced by a + V_PERM_B32_e64. + arsenm wrote: These scheduling rules seem way too specific. Especially that it's pointing out specific instruction encodings, by the internal pseudoinstruction names https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/arsenm commented: I don't understand how anyone is supposed to use this. This is exposing extremely specific, random low level details of the scheduling. Users claim they want scheduling controls, but what they actually want is the scheduler to just do the right thing. We should spent more energy making the scheduler sensible by default, instead of creating all of this complexity. If we're going to have something like this, it needs to have predefined macros instead of expecting reading https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -437,16 +437,18 @@ void test_sched_group_barrier() } // CHECK-LABEL: @test_sched_group_barrier_rule -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i32 0) -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i32 0) -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 16, i32 100) -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, i32 -1, i32 -100) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i64 1) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i64 1) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i64 -9223372036854775808) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, i64 255) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, i64 1) void test_sched_group_barrier_rule() { __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0); __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0); - __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100); - __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100); + __builtin_amdgcn_sched_group_barrier(1, 2, 4, 63); + __builtin_amdgcn_sched_group_barrier(2, 4, 6, 0, 1, 2, 3, 4, 5, 6, 7); jrbyrnes wrote: Do you prefer having the latest iteration wherein users provide a mask instead of the variadic ? https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/jrbyrnes updated https://github.com/llvm/llvm-project/pull/85304 >From 04dc59ff7757dea18e2202d1cbff1d675885fdae Mon Sep 17 00:00:00 2001 From: Jeffrey Byrnes Date: Tue, 12 Mar 2024 10:22:24 -0700 Subject: [PATCH 1/4] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. Change-Id: Id8460dc42f41575760793c0fc70e0bc0aecc0d5e --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 2 +- clang/lib/CodeGen/CGBuiltin.cpp | 17 +++ clang/test/CodeGenOpenCL/builtins-amdgcn.cl | 14 +++ llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 15 ++- llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | 112 -- llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp | 14 +++ llvm/lib/Target/AMDGPU/SIInstructions.td | 16 +++ llvm/lib/Target/AMDGPU/SIPostRABundler.cpp| 3 +- .../AMDGPU/llvm.amdgcn.sched.group.barrier.ll | 25 9 files changed, 202 insertions(+), 16 deletions(-) diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 61ec8b79bf054d..f7b6a4610bd80a 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n") BUILTIN(__builtin_amdgcn_s_barrier, "v", "n") BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n") BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n") -BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n") +BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n") BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n") BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n") BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 528a13fb275124..4bf71c7535db63 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, case AMDGPU::BI__builtin_amdgcn_grid_size_z: return EmitAMDGPUGridSize(*this, 2); + // scheduling builtins + case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: { +return E->getNumArgs() == 3 + ? Builder.CreateCall( + CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2))}) + : Builder.CreateCall( + CGM.getIntrinsic( + Intrinsic::amdgcn_sched_group_barrier_rule), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2)), + EmitScalarExpr(E->getArg(3))}); + } + // r600 intrinsics case AMDGPU::BI__builtin_r600_recipsqrt_ieee: case AMDGPU::BI__builtin_r600_recipsqrt_ieeef: diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl index 8a4533633706b2..e28e0a6987484b 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl @@ -436,6 +436,20 @@ void test_sched_group_barrier() __builtin_amdgcn_sched_group_barrier(15, 1, -1); } +// CHECK-LABEL: @test_sched_group_barrier_rule +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 16, i32 100) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, i32 -1, i32 -100) +void test_sched_group_barrier_rule() +{ + __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0); + __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0); + __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100); + __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100); +} + + // CHECK-LABEL: @test_iglp_opt // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0) // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1) diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 051e603c0819d2..68fe42a8f04d21 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier : ClangBuiltin<"__builtin_amdgcn_sched_barrier">, // matching instructions that will be associated with this sched_group_barrier. // The third parameter is an identifier which is used to describe what other // sched_group_barriers should be synchronized with. -def int_amdgcn_sched_group_barrier : ClangBuiltin<"__builtin_amdgcn_sched_group_barrier">, - Intrinsic<[], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty], - [ImmArg>, ImmArg>, ImmArg>, IntrNoMem, IntrHasSideEffects, - IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>; +multiclass
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -18763,19 +18763,28 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, // scheduling builtins case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: { -return E->getNumArgs() == 3 - ? Builder.CreateCall( - CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), - {EmitScalarExpr(E->getArg(0)), - EmitScalarExpr(E->getArg(1)), - EmitScalarExpr(E->getArg(2))}) - : Builder.CreateCall( - CGM.getIntrinsic( - Intrinsic::amdgcn_sched_group_barrier_rule), - {EmitScalarExpr(E->getArg(0)), - EmitScalarExpr(E->getArg(1)), - EmitScalarExpr(E->getArg(2)), - EmitScalarExpr(E->getArg(3))}); +if (E->getNumArgs() == 3) + return Builder.CreateCall( + CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), + {EmitScalarExpr(E->getArg(0)), EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2))}); + +uint64_t Mask = 0; +for (unsigned I = 3; I < E->getNumArgs(); I++) { + auto NextArg = EmitScalarExpr(E->getArg(I)); + auto ArgLiteral = cast(NextArg)->getZExtValue(); + if (ArgLiteral > 63) { +CGM.Error(E->getExprLoc(), + getContext().BuiltinInfo.getName(BuiltinID).str() + + " RuleID must be within [0,63]."); arsenm wrote: Should such checks go in Sema instead? https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -18763,19 +18763,28 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, // scheduling builtins case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: { -return E->getNumArgs() == 3 - ? Builder.CreateCall( - CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), - {EmitScalarExpr(E->getArg(0)), - EmitScalarExpr(E->getArg(1)), - EmitScalarExpr(E->getArg(2))}) - : Builder.CreateCall( - CGM.getIntrinsic( - Intrinsic::amdgcn_sched_group_barrier_rule), - {EmitScalarExpr(E->getArg(0)), - EmitScalarExpr(E->getArg(1)), - EmitScalarExpr(E->getArg(2)), - EmitScalarExpr(E->getArg(3))}); +if (E->getNumArgs() == 3) arsenm wrote: Braces https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -0,0 +1,10 @@ +// RUN: %clang_cc1 -O0 -cl-std=CL2.0 -triple amdgcn-amd-amdhsa -target-cpu gfx90a \ +// RUN: -verify -S -o - %s + arsenm wrote: -verify tests belong in Sema, there's no codegen here https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -2747,23 +2749,32 @@ void IGroupLPDAGMutation::initSchedGroupBarrierPipelineStage( int32_t SGMask = SGB.getOperand(0).getImm(); int32_t Size = SGB.getOperand(1).getImm(); int32_t SyncID = SGB.getOperand(2).getImm(); - std::optional RuleID = + std::optional RuleMask = (SGB.getOpcode() == AMDGPU::SCHED_GROUP_BARRIER_RULE) ? SGB.getOperand(3).getImm() : std::optional(std::nullopt); - // Sanitize the input - if (RuleID && (!SchedGroupBarrierRuleCallBacks.size() || - *RuleID > (int)(SchedGroupBarrierRuleCallBacks.size() - 1))) { -RuleID = std::nullopt; -llvm_unreachable("Bad rule ID!"); - } - auto SG = [SyncID].emplace_back((SchedGroupMask)SGMask, Size, SyncID, DAG, TII); - if (RuleID) -SG->addRule(SchedGroupBarrierRuleCallBacks[*RuleID](SG->getSGID())); + // Process the input mask + if (RuleMask) { +uint64_t TheMask = *RuleMask; +unsigned NextID = 0; +while (TheMask) { + if (!(TheMask & 0x1)) { +TheMask >>= 1; +++NextID; +continue; + } + if ((!SchedGroupBarrierRuleCallBacks.size() || + NextID > SchedGroupBarrierRuleCallBacks.size() - 1)) arsenm wrote: the !size() check is redundant? https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
@@ -437,16 +437,18 @@ void test_sched_group_barrier() } // CHECK-LABEL: @test_sched_group_barrier_rule -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i32 0) -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i32 0) -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 16, i32 100) -// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, i32 -1, i32 -100) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i64 1) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i64 1) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i64 -9223372036854775808) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, i64 255) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 2, i32 4, i32 6, i64 1) void test_sched_group_barrier_rule() { __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0); __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0); - __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100); - __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100); + __builtin_amdgcn_sched_group_barrier(1, 2, 4, 63); + __builtin_amdgcn_sched_group_barrier(2, 4, 6, 0, 1, 2, 3, 4, 5, 6, 7); arsenm wrote: I have no idea what all these numbers are supposed to mean https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
jrbyrnes wrote: Updated the PR as discussed offline. Support the variadic builtin arg via combining into mask for intrinsic. This sort of implies a limit of 64 rules, but we can workaround by add a new intrinsic with two masks (to support rules 65-128), and so on. For now, rules in this PR behave as they do in the existing code (that is, they are additional inclusion criteria). Any changes to this will be addressed in future PR. https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/jrbyrnes ready_for_review https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/jrbyrnes updated https://github.com/llvm/llvm-project/pull/85304 >From 04dc59ff7757dea18e2202d1cbff1d675885fdae Mon Sep 17 00:00:00 2001 From: Jeffrey Byrnes Date: Tue, 12 Mar 2024 10:22:24 -0700 Subject: [PATCH 1/2] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. Change-Id: Id8460dc42f41575760793c0fc70e0bc0aecc0d5e --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 2 +- clang/lib/CodeGen/CGBuiltin.cpp | 17 +++ clang/test/CodeGenOpenCL/builtins-amdgcn.cl | 14 +++ llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 15 ++- llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | 112 -- llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp | 14 +++ llvm/lib/Target/AMDGPU/SIInstructions.td | 16 +++ llvm/lib/Target/AMDGPU/SIPostRABundler.cpp| 3 +- .../AMDGPU/llvm.amdgcn.sched.group.barrier.ll | 25 9 files changed, 202 insertions(+), 16 deletions(-) diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 61ec8b79bf054d..f7b6a4610bd80a 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n") BUILTIN(__builtin_amdgcn_s_barrier, "v", "n") BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n") BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n") -BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n") +BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n") BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n") BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n") BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 528a13fb275124..4bf71c7535db63 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, case AMDGPU::BI__builtin_amdgcn_grid_size_z: return EmitAMDGPUGridSize(*this, 2); + // scheduling builtins + case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: { +return E->getNumArgs() == 3 + ? Builder.CreateCall( + CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2))}) + : Builder.CreateCall( + CGM.getIntrinsic( + Intrinsic::amdgcn_sched_group_barrier_rule), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2)), + EmitScalarExpr(E->getArg(3))}); + } + // r600 intrinsics case AMDGPU::BI__builtin_r600_recipsqrt_ieee: case AMDGPU::BI__builtin_r600_recipsqrt_ieeef: diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl index 8a4533633706b2..e28e0a6987484b 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl @@ -436,6 +436,20 @@ void test_sched_group_barrier() __builtin_amdgcn_sched_group_barrier(15, 1, -1); } +// CHECK-LABEL: @test_sched_group_barrier_rule +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 16, i32 100) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, i32 -1, i32 -100) +void test_sched_group_barrier_rule() +{ + __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0); + __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0); + __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100); + __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100); +} + + // CHECK-LABEL: @test_iglp_opt // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0) // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1) diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 051e603c0819d2..68fe42a8f04d21 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier : ClangBuiltin<"__builtin_amdgcn_sched_barrier">, // matching instructions that will be associated with this sched_group_barrier. // The third parameter is an identifier which is used to describe what other // sched_group_barriers should be synchronized with. -def int_amdgcn_sched_group_barrier : ClangBuiltin<"__builtin_amdgcn_sched_group_barrier">, - Intrinsic<[], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty], - [ImmArg>, ImmArg>, ImmArg>, IntrNoMem, IntrHasSideEffects, - IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>; +multiclass
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
llvmbot wrote: @llvm/pr-subscribers-backend-amdgpu Author: Jeffrey Byrnes (jrbyrnes) Changes I am still working with the user to define the actual rules, so it is still a WIP. However, this current version contains the main machinery of the feature. This helps bridge the gap between sched_group_barrier and iglp_opt, enabling users (with compiler support) more ability to create the pipelines they want. In particular, this is aimed at helping control scheduling in blocks with loop-carried dependencies. Since this is a global scheduling problem, there is no straightforward way to tune the scheduler against these blocks. --- Full diff: https://github.com/llvm/llvm-project/pull/85304.diff 9 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1-1) - (modified) clang/lib/CodeGen/CGBuiltin.cpp (+17) - (modified) clang/test/CodeGenOpenCL/builtins-amdgcn.cl (+14) - (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+11-4) - (modified) llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp (+102-10) - (modified) llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp (+14) - (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+16) - (modified) llvm/lib/Target/AMDGPU/SIPostRABundler.cpp (+2-1) - (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.sched.group.barrier.ll (+25) ``diff diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 61ec8b79bf054d..f7b6a4610bd80a 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n") BUILTIN(__builtin_amdgcn_s_barrier, "v", "n") BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n") BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n") -BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n") +BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n") BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n") BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n") BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 528a13fb275124..4bf71c7535db63 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, case AMDGPU::BI__builtin_amdgcn_grid_size_z: return EmitAMDGPUGridSize(*this, 2); + // scheduling builtins + case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: { +return E->getNumArgs() == 3 + ? Builder.CreateCall( + CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2))}) + : Builder.CreateCall( + CGM.getIntrinsic( + Intrinsic::amdgcn_sched_group_barrier_rule), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2)), + EmitScalarExpr(E->getArg(3))}); + } + // r600 intrinsics case AMDGPU::BI__builtin_r600_recipsqrt_ieee: case AMDGPU::BI__builtin_r600_recipsqrt_ieeef: diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl index 8a4533633706b2..e28e0a6987484b 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl @@ -436,6 +436,20 @@ void test_sched_group_barrier() __builtin_amdgcn_sched_group_barrier(15, 1, -1); } +// CHECK-LABEL: @test_sched_group_barrier_rule +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 16, i32 100) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, i32 -1, i32 -100) +void test_sched_group_barrier_rule() +{ + __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0); + __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0); + __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100); + __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100); +} + + // CHECK-LABEL: @test_iglp_opt // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0) // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1) diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 051e603c0819d2..68fe42a8f04d21 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier : ClangBuiltin<"__builtin_amdgcn_sched_barrier">, // matching instructions that will be associated with this sched_group_barrier. // The third parameter is an identifier which is
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
jrbyrnes wrote: Supersedes https://github.com/llvm/llvm-project/pull/78775 https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/jrbyrnes edited https://github.com/llvm/llvm-project/pull/85304 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. (PR #85304)
https://github.com/jrbyrnes created https://github.com/llvm/llvm-project/pull/85304 I am still working with the user to define the actual rules, so it is still a WIP. However, the current version contains the main machinery of the feature. This helps bridge the gap between sched_group_barrier and iglp_opt, enabling users (with compiler support) more ability to create the pipelines they want. In particular, this is aimed at helping control scheduling in blocks with loop-carried dependencies. Since this is a global scheduling problem, there is no straightforward way to tune the scheduler against these blocks. >From 04dc59ff7757dea18e2202d1cbff1d675885fdae Mon Sep 17 00:00:00 2001 From: Jeffrey Byrnes Date: Tue, 12 Mar 2024 10:22:24 -0700 Subject: [PATCH] [AMDGPU] Extend __builtin_amdgcn_sched_group_barrier to support rules. Change-Id: Id8460dc42f41575760793c0fc70e0bc0aecc0d5e --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 2 +- clang/lib/CodeGen/CGBuiltin.cpp | 17 +++ clang/test/CodeGenOpenCL/builtins-amdgcn.cl | 14 +++ llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 15 ++- llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp | 112 -- llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp | 14 +++ llvm/lib/Target/AMDGPU/SIInstructions.td | 16 +++ llvm/lib/Target/AMDGPU/SIPostRABundler.cpp| 3 +- .../AMDGPU/llvm.amdgcn.sched.group.barrier.ll | 25 9 files changed, 202 insertions(+), 16 deletions(-) diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 61ec8b79bf054d..f7b6a4610bd80a 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -63,7 +63,7 @@ BUILTIN(__builtin_amdgcn_s_sendmsghalt, "vIiUi", "n") BUILTIN(__builtin_amdgcn_s_barrier, "v", "n") BUILTIN(__builtin_amdgcn_wave_barrier, "v", "n") BUILTIN(__builtin_amdgcn_sched_barrier, "vIi", "n") -BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi", "n") +BUILTIN(__builtin_amdgcn_sched_group_barrier, "vIiIiIi.", "n") BUILTIN(__builtin_amdgcn_iglp_opt, "vIi", "n") BUILTIN(__builtin_amdgcn_s_dcache_inv, "v", "n") BUILTIN(__builtin_amdgcn_buffer_wbinvl1, "v", "n") diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp index 528a13fb275124..4bf71c7535db63 100644 --- a/clang/lib/CodeGen/CGBuiltin.cpp +++ b/clang/lib/CodeGen/CGBuiltin.cpp @@ -18761,6 +18761,23 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, case AMDGPU::BI__builtin_amdgcn_grid_size_z: return EmitAMDGPUGridSize(*this, 2); + // scheduling builtins + case AMDGPU::BI__builtin_amdgcn_sched_group_barrier: { +return E->getNumArgs() == 3 + ? Builder.CreateCall( + CGM.getIntrinsic(Intrinsic::amdgcn_sched_group_barrier), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2))}) + : Builder.CreateCall( + CGM.getIntrinsic( + Intrinsic::amdgcn_sched_group_barrier_rule), + {EmitScalarExpr(E->getArg(0)), + EmitScalarExpr(E->getArg(1)), + EmitScalarExpr(E->getArg(2)), + EmitScalarExpr(E->getArg(3))}); + } + // r600 intrinsics case AMDGPU::BI__builtin_r600_recipsqrt_ieee: case AMDGPU::BI__builtin_r600_recipsqrt_ieeef: diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl index 8a4533633706b2..e28e0a6987484b 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn.cl @@ -436,6 +436,20 @@ void test_sched_group_barrier() __builtin_amdgcn_sched_group_barrier(15, 1, -1); } +// CHECK-LABEL: @test_sched_group_barrier_rule +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 0, i32 1, i32 2, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 1, i32 2, i32 4, i32 0) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 4, i32 8, i32 16, i32 100) +// CHECK: call void @llvm.amdgcn.sched.group.barrier.rule(i32 15, i32 1, i32 -1, i32 -100) +void test_sched_group_barrier_rule() +{ + __builtin_amdgcn_sched_group_barrier(0, 1, 2, 0); + __builtin_amdgcn_sched_group_barrier(1, 2, 4, 0); + __builtin_amdgcn_sched_group_barrier(4, 8, 16, 100); + __builtin_amdgcn_sched_group_barrier(15, 1, -1, -100); +} + + // CHECK-LABEL: @test_iglp_opt // CHECK: call void @llvm.amdgcn.iglp.opt(i32 0) // CHECK: call void @llvm.amdgcn.iglp.opt(i32 1) diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td index 051e603c0819d2..68fe42a8f04d21 100644 --- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td +++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td @@ -297,10 +297,17 @@ def int_amdgcn_sched_barrier :