[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/arsenm commented: Should lose the [WIP] in the title https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
vikramRH wrote: > > 1. What's the proper way to legalize f16 and bf16 for SDAG case without > > bitcasts ? (I would think "fp_extend -> LaneOp -> Fptrunc" is wrong) > > Bitcast to i16, anyext to i32, laneop, trunc to i16, bitcast to original type. > > Why wouldn't you use bitcasts? Just a doubt I had on previous comments, sorry for the noise ! https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering , SDNode *N, DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE)); } +static SDValue lowerLaneOp(const SITargetLowering , SDNode *N, + SelectionDAG ) { + EVT VT = N->getValueType(0); + unsigned ValSize = VT.getSizeInBits(); + unsigned IntrinsicID = N->getConstantOperandVal(0); + SDValue Src0 = N->getOperand(1); + SDLoc SL(N); + MVT IntVT = MVT::getIntegerVT(ValSize); + + auto createLaneOp = [, ](SDValue Src0, SDValue Src1, SDValue Src2, + MVT VT) -> SDValue { +return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, Src2}) +: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1}) + : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0})); + }; + + SDValue Src1, Src2; + if (IntrinsicID == Intrinsic::amdgcn_readlane || + IntrinsicID == Intrinsic::amdgcn_writelane) { +Src1 = N->getOperand(2); +if (IntrinsicID == Intrinsic::amdgcn_writelane) + Src2 = N->getOperand(3); + } + + if (ValSize == 32) { +// Already legal +return SDValue(); + } + + if (ValSize < 32) { +SDValue InitBitCast = DAG.getBitcast(IntVT, Src0); +Src0 = DAG.getAnyExtOrTrunc(InitBitCast, SL, MVT::i32); +if (Src2.getNode()) { + SDValue Src2Cast = DAG.getBitcast(IntVT, Src2); arsenm wrote: Yes, bitcast for the f16/bf16 case to get to the int https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5456,43 +5444,32 @@ bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , if ((Size % 32) == 0) { SmallVector PartialRes; unsigned NumParts = Size / 32; -auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; +bool IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; arsenm wrote: Better to track this as the LLT to use for the pieces, rather than making it this conditional thing. This will simplify improved pointer handling in the future https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
jayfoad wrote: > 1. What's the proper way to legalize f16 and bf16 for SDAG case without > bitcasts ? (I would think "fp_extend -> LaneOp -> Fptrunc" is wrong) Bitcast to i16, anyext to i32, laneop, trunc to i16, bitcast to original type. Why wouldn't you use bitcasts? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
vikramRH wrote: updated the GIsel legalizer, I still have couple of questions for SDAG case though, 1. What's the proper way to legalize f16 and bf16 for SDAG case without bitcasts ? (I would think "fp_extend -> LaneOp -> Fptrunc" is wrong) 2. For scalar cases such as i64, f64, i128 .. (i.e 32 bit multiples), I guess bitcast to vectors (v2i32, v2f32, v4i32) is unavoidable since "UnrollVectorOp" wouldn't work otherwise. any alternalte suggestions here ? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,192 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +// Already legal +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0); +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} + +Register LaneOpDst = createLaneOp(Src0, Src1, Src2); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDst); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; +MachineInstrBuilder Src0Parts; + +if (Ty.isPointer()) { + auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else if (Ty.isPointerVector()) { + LLT IntVecTy = Ty.changeElementType( + LLT::scalar(Ty.getElementType().getSizeInBits())); + auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else + Src0Parts = + IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0) + .addUse(Src1)) +.getReg(0)); + } + break; +} +case Intrinsic::amdgcn_readfirstlane: { + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32}) + .addUse(Src0) + .getReg(0))); + } + + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src1 = MI.getOperand(3).getReg(); + Register Src2 = MI.getOperand(4).getReg(); + MachineInstrBuilder Src2Parts; + + if (Ty.isPointer()) { +auto PtrToInt = B.buildPtrToInt(S64, Src2); +Src2Parts = B.buildUnmerge(S32, PtrToInt); + } else if (Ty.isPointerVector()) { +LLT IntVecTy = Ty.changeElementType( +LLT::scalar(Ty.getElementType().getSizeInBits())); +auto PtrToInt = B.buildPtrToInt(IntVecTy, Src2); +Src2Parts = B.buildUnmerge(S32, PtrToInt); + } else +Src2Parts = +IsS16Vec ? B.buildUnmerge(V2S16, Src2) : B.buildUnmerge(S32, Src2); vikramRH wrote: done https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering , SDNode *N, DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE)); } +static SDValue lowerLaneOp(const SITargetLowering , SDNode *N, + SelectionDAG ) { + EVT VT = N->getValueType(0); + unsigned ValSize = VT.getSizeInBits(); + unsigned IntrinsicID = N->getConstantOperandVal(0); + SDValue Src0 = N->getOperand(1); + SDLoc SL(N); + MVT IntVT = MVT::getIntegerVT(ValSize); + + auto createLaneOp = [, ](SDValue Src0, SDValue Src1, SDValue Src2, + MVT VT) -> SDValue { +return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, Src2}) +: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1}) + : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0})); + }; + + SDValue Src1, Src2; + if (IntrinsicID == Intrinsic::amdgcn_readlane || + IntrinsicID == Intrinsic::amdgcn_writelane) { +Src1 = N->getOperand(2); +if (IntrinsicID == Intrinsic::amdgcn_writelane) + Src2 = N->getOperand(3); + } + + if (ValSize == 32) { +// Already legal +return SDValue(); + } + + if (ValSize < 32) { +SDValue InitBitCast = DAG.getBitcast(IntVT, Src0); +Src0 = DAG.getAnyExtOrTrunc(InitBitCast, SL, MVT::i32); +if (Src2.getNode()) { + SDValue Src2Cast = DAG.getBitcast(IntVT, Src2); vikramRH wrote: What would be the proper way to legalize f16 and bf16 for SDAG case without bitcasts ? (Im currently thinking "fp_extend -> LaneOp -> Fptrunc" which seems wrong) https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,192 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +// Already legal +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0); +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} + +Register LaneOpDst = createLaneOp(Src0, Src1, Src2); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDst); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; arsenm wrote: no auto https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,192 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +// Already legal +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0); +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} + +Register LaneOpDst = createLaneOp(Src0, Src1, Src2); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDst); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; +MachineInstrBuilder Src0Parts; + +if (Ty.isPointer()) { + auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else if (Ty.isPointerVector()) { + LLT IntVecTy = Ty.changeElementType( + LLT::scalar(Ty.getElementType().getSizeInBits())); + auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else + Src0Parts = + IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0) + .addUse(Src1)) +.getReg(0)); + } + break; +} +case Intrinsic::amdgcn_readfirstlane: { + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32}) + .addUse(Src0) + .getReg(0))); + } + + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src1 = MI.getOperand(3).getReg(); + Register Src2 = MI.getOperand(4).getReg(); + MachineInstrBuilder Src2Parts; + + if (Ty.isPointer()) { +auto PtrToInt = B.buildPtrToInt(S64, Src2); +Src2Parts = B.buildUnmerge(S32, PtrToInt); + } else if (Ty.isPointerVector()) { +LLT IntVecTy = Ty.changeElementType( +LLT::scalar(Ty.getElementType().getSizeInBits())); +auto PtrToInt = B.buildPtrToInt(IntVecTy, Src2); +Src2Parts = B.buildUnmerge(S32, PtrToInt); + } else +Src2Parts = +IsS16Vec ? B.buildUnmerge(V2S16, Src2) : B.buildUnmerge(S32, Src2); arsenm wrote: The point of splitting out the pointer typed tests was to avoid fixing the handling of the pointer typed selection patterns. You still have the casts inserted here. You should not need any ptrtoint, inttoptr, or bitcasts in any of these legalizations https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/arsenm requested changes to this pull request. There should be no need to introduce same-sized value casts, whether bitcast or ptrtoint in either legalizer https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering , SDNode *N, DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE)); } +static SDValue lowerLaneOp(const SITargetLowering , SDNode *N, + SelectionDAG ) { + EVT VT = N->getValueType(0); + unsigned ValSize = VT.getSizeInBits(); + unsigned IntrinsicID = N->getConstantOperandVal(0); + SDValue Src0 = N->getOperand(1); + SDLoc SL(N); + MVT IntVT = MVT::getIntegerVT(ValSize); + + auto createLaneOp = [, ](SDValue Src0, SDValue Src1, SDValue Src2, + MVT VT) -> SDValue { +return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, Src2}) +: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1}) + : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0})); + }; + + SDValue Src1, Src2; + if (IntrinsicID == Intrinsic::amdgcn_readlane || + IntrinsicID == Intrinsic::amdgcn_writelane) { +Src1 = N->getOperand(2); +if (IntrinsicID == Intrinsic::amdgcn_writelane) + Src2 = N->getOperand(3); + } + + if (ValSize == 32) { +// Already legal +return SDValue(); + } + + if (ValSize < 32) { +SDValue InitBitCast = DAG.getBitcast(IntVT, Src0); +Src0 = DAG.getAnyExtOrTrunc(InitBitCast, SL, MVT::i32); +if (Src2.getNode()) { + SDValue Src2Cast = DAG.getBitcast(IntVT, Src2); arsenm wrote: You should not have any bitcasts anywhere https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,192 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +// Already legal +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0); +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} + +Register LaneOpDst = createLaneOp(Src0, Src1, Src2); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDst); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; +MachineInstrBuilder Src0Parts; + +if (Ty.isPointer()) { + auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else if (Ty.isPointerVector()) { + LLT IntVecTy = Ty.changeElementType( + LLT::scalar(Ty.getElementType().getSizeInBits())); + auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else + Src0Parts = + IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0) + .addUse(Src1)) +.getReg(0)); + } + break; +} +case Intrinsic::amdgcn_readfirstlane: { + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32}) + .addUse(Src0) + .getReg(0))); + } + + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src1 = MI.getOperand(3).getReg(); + Register Src2 = MI.getOperand(4).getReg(); + MachineInstrBuilder Src2Parts; + + if (Ty.isPointer()) { +auto PtrToInt = B.buildPtrToInt(S64, Src2); +Src2Parts = B.buildUnmerge(S32, PtrToInt); + } else if (Ty.isPointerVector()) { +LLT IntVecTy = Ty.changeElementType( +LLT::scalar(Ty.getElementType().getSizeInBits())); +auto PtrToInt = B.buildPtrToInt(IntVecTy, Src2); +Src2Parts = B.buildUnmerge(S32, PtrToInt); + } else +Src2Parts = +IsS16Vec ? B.buildUnmerge(V2S16, Src2) : B.buildUnmerge(S32, Src2); + + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +Src2 = IsS16Vec ? B.buildBitcast(S32, Src2Parts.getReg(i)).getReg(0) +: Src2Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_writelane, {S32}) + .addUse(Src0) +
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,192 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +// Already legal +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0); +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} + +Register LaneOpDst = createLaneOp(Src0, Src1, Src2); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDst); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16; +MachineInstrBuilder Src0Parts; + +if (Ty.isPointer()) { + auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else if (Ty.isPointerVector()) { + LLT IntVecTy = Ty.changeElementType( + LLT::scalar(Ty.getElementType().getSizeInBits())); + auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0); + Src0Parts = B.buildUnmerge(S32, PtrToInt); +} else + Src0Parts = + IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) +: Src0Parts.getReg(i); +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0) + .addUse(Src1)) +.getReg(0)); + } + break; +} +case Intrinsic::amdgcn_readfirstlane: { + for (unsigned i = 0; i < NumParts; ++i) { +Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0) arsenm wrote: No bitcasts https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/arsenm edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) vikramRH wrote: Done https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering , SDNode *N, DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE)); } +static SDValue lowerLaneOp(const SITargetLowering , SDNode *N, + SelectionDAG ) { + EVT VT = N->getValueType(0); + unsigned ValSize = VT.getSizeInBits(); + unsigned IntrinsicID = N->getConstantOperandVal(0); + SDValue Src0 = N->getOperand(1); + SDLoc SL(N); + MVT IntVT = MVT::getIntegerVT(ValSize); + + auto createLaneOp = [&](SDValue Src0, SDValue Src1, SDValue Src2, + MVT VT) -> SDValue { arsenm wrote: ? VT shadow https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) arsenm wrote: I'd rather just leave the pointer cases failing in the global isel case, and remove all the cast insertion bits. You could split out the pointer cases to a separate file and just not run it with globalisel. https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) vikramRH wrote: Do you think these changes are okay until I figure out root cause ? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) arsenm wrote: I think GlobalISelEmitter is just ignoring PtrValueType https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) arsenm wrote: Seems like this is just a GlobalISelEmitter bug (and another example of we we should have a way to write patterns that only care about the bit size) https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -342,6 +342,22 @@ def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2", def AMDGPUperm_impl : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>; +def AMDGPUReadfirstlaneOp : SDTypeProfile<1, 1, [ + SDTCisSameAs<0, 1> +]>; + +def AMDGPUReadlaneOp : SDTypeProfile<1, 2, [ + SDTCisSameAs<0, 1>, SDTCisInt<2> +]>; + +def AMDGPUDWritelaneOp : SDTypeProfile<1, 3, [ + SDTCisSameAs<1, 1>, SDTCisInt<2>, SDTCisSameAs<0, 3>, vikramRH wrote: Thanks for pointing this, missed updating this latest version. updated now, however issue is not related to this https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -342,6 +342,22 @@ def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2", def AMDGPUperm_impl : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>; +def AMDGPUReadfirstlaneOp : SDTypeProfile<1, 1, [ + SDTCisSameAs<0, 1> +]>; + +def AMDGPUReadlaneOp : SDTypeProfile<1, 2, [ + SDTCisSameAs<0, 1>, SDTCisInt<2> +]>; + +def AMDGPUDWritelaneOp : SDTypeProfile<1, 3, [ + SDTCisSameAs<1, 1>, SDTCisInt<2>, SDTCisSameAs<0, 3>, rovka wrote: <0,1> https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) vikramRH wrote: Attaching example match table snippets for v2i16 and p3 here, should make the scenario bit more clear, for v2i16 ``` GIM_Try, /*On fail goto*//*Label 3499*/ GIMT_Encode4(202699), // Rule ID 2117 // GIM_CheckIntrinsicID, /*MI*/0, /*Op*/1, GIMT_Encode2(Intrinsic::amdgcn_writelane), GIM_RootCheckType, /*Op*/0, /*Type*/GILLT_v2s16, GIM_RootCheckType, /*Op*/2, /*Type*/GILLT_v2s16, GIM_RootCheckType, /*Op*/3, /*Type*/GILLT_s32, GIM_RootCheckType, /*Op*/4, /*Type*/GILLT_v2s16, GIM_RootCheckRegBankForClass, /*Op*/0, /*RC*/GIMT_Encode2(AMDGPU::VGPR_32RegClassID), // (intrinsic_wo_chain:{ *:[v2i16] } 2863:{ *:[iPTR] }, v2i16:{ *:[v2i16] }:$src0, i32:{ *:[i32] }:$src1, v2i16:{ *:[v2i16] }:$src2) => (V_WRITELANE_B32:{ *:[v2i16] } SCSrc_b32:{ *:[v2i16] }:$src0, SCSrc_b32:{ *:[i32] }:$src1, VGPR_32:{ *:[v2i16] }:$src2) GIR_BuildRootMI, /*Opcode*/GIMT_Encode2(AMDGPU::V_WRITELANE_B32), ``` and for p3, ``` GIM_Try, /*On fail goto*//*Label 3502*/ GIMT_Encode4(202816), // Rule ID 2129 // GIM_CheckIntrinsicID, /*MI*/0, /*Op*/1, GIMT_Encode2(Intrinsic::amdgcn_writelane), GIM_RootCheckType, /*Op*/0, /*Type*/GILLT_s32, GIM_RootCheckType, /*Op*/2, /*Type*/GILLT_p2s32, GIM_RootCheckType, /*Op*/3, /*Type*/GILLT_s32, GIM_RootCheckType, /*Op*/4, /*Type*/GILLT_p2s32, GIM_RootCheckRegBankForClass, /*Op*/0, /*RC*/GIMT_Encode2(AMDGPU::VGPR_32RegClassID), // (intrinsic_wo_chain:{ *:[i32] } 2863:{ *:[iPTR] }, p2:{ *:[i32] }:$src0, i32:{ *:[i32] }:$src1, p2:{ *:[i32] }:$src2) => (V_WRITELANE_B32:{ *:[i32] } SCSrc_b32:{ *:[i32] }:$src0, SCSrc_b32:{ *:[i32] }:$src1, VGPR_32:{ *:[i32] }:$src2) GIR_BuildRootMI, /*Opcode*/GIMT_Encode2(AMDGPU::V_WRITELANE_B32), ``` The destination type check for p3 case is still for "GILLT_s32", https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) arsenm wrote: What is the problem? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) vikramRH wrote: Unfortunately no, Had tried this and couple of other variations. the issue seems to be too specific to GIsel pointers.. https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -780,14 +780,22 @@ defm V_SUBREV_U32 : VOP2Inst <"v_subrev_u32", VOP_I32_I32_I32_ARITH, null_frag, // These are special and do not read the exec mask. let isConvergent = 1, Uses = [] in { -def V_READLANE_B32 : VOP2_Pseudo<"v_readlane_b32", VOP_READLANE, - [(set i32:$vdst, (int_amdgcn_readlane i32:$src0, i32:$src1))]>; +def V_READLANE_B32 : VOP2_Pseudo<"v_readlane_b32", VOP_READLANE,[]>; let IsNeverUniform = 1, Constraints = "$vdst = $vdst_in", DisableEncoding="$vdst_in" in { -def V_WRITELANE_B32 : VOP2_Pseudo<"v_writelane_b32", VOP_WRITELANE, - [(set i32:$vdst, (int_amdgcn_writelane i32:$src0, i32:$src1, i32:$vdst_in))]>; +def V_WRITELANE_B32 : VOP2_Pseudo<"v_writelane_b32", VOP_WRITELANE, []>; } // End IsNeverUniform, $vdst = $vdst_in, DisableEncoding $vdst_in } // End isConvergent = 1 +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadlane vt:$src0, i32:$src1)), +(V_READLANE_B32 VRegOrLdsSrc_32:$src0, SCSrc_b32:$src1) arsenm wrote: Same here, supply the type to the result instruction https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, untyped]> { // FIXME: Specify SchedRW for READFIRSTLANE_B32 // TODO: There is VOP3 encoding also def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", VOP_READFIRSTLANE, - getVOP1Pat.ret, 1> { + [], 1> { let isConvergent = 1; } +foreach vt = Reg32Types.types in { + def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))), +(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)) arsenm wrote: ```suggestion (vt (V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))) ``` This may fix your pattern type deduction issue https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,212 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal vikramRH wrote: Also the issue is only for pointer types, float, v2i16 etc work just fine https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,212 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal vikramRH wrote: Done except for pointers. I currently see an issue where pattern type inference somehow deduces destination type to scalars (instead of say LLT_ p3s32). not currently sure why , any ideas ? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -6086,6 +6086,68 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering , SDNode *N, DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE)); } +static SDValue lowerLaneOp(const SITargetLowering , SDNode *N, + SelectionDAG ) { + EVT VT = N->getValueType(0); + unsigned ValSize = VT.getSizeInBits(); + unsigned IntrinsicID = N->getConstantOperandVal(0); + SDValue Src0 = N->getOperand(1); + SDLoc SL(N); + MVT IntVT = MVT::getIntegerVT(ValSize); + + auto createLaneOp = [&](SDValue Src0, SDValue Src1, SDValue Src2, + MVT VT) -> SDValue { +return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, Src2}) +: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1}) + : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0})); + }; + + SDValue Src1, Src2; + if (IntrinsicID == Intrinsic::amdgcn_readlane || + IntrinsicID == Intrinsic::amdgcn_writelane) { +Src1 = N->getOperand(2); +if (IntrinsicID == Intrinsic::amdgcn_writelane) + Src2 = N->getOperand(3); + } + + if (ValSize == 32) { +if (VT == MVT::i32) + // Already legal + return SDValue(); +Src0 = DAG.getBitcast(IntVT, Src0); arsenm wrote: Like the other cases, we should be able to avoid intermediate casting https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -3400,7 +3400,7 @@ def : GCNPat< // FIXME: Should also do this for readlane, but tablegen crashes on // the ignored src1. def : GCNPat< - (int_amdgcn_readfirstlane (i32 imm:$src)), + (i32 (AMDGPUreadfirstlane (i32 imm:$src))), arsenm wrote: We might need to make this fold more sophisticated for other types, but best for a follow up patch https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,212 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal arsenm wrote: Either add braces or don't put the comment under the if. Also, the size 32 case should be treated as directly legal for 32-bit pointers https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5387,6 +5387,212 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register Src0, Register Src1, + Register Src2) -> Register { +auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0); +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: + return LaneOp.getReg(0); +case Intrinsic::amdgcn_readlane: + return LaneOp.addUse(Src1).getReg(0); +case Intrinsic::amdgcn_writelane: + return LaneOp.addUse(Src1).addUse(Src2).getReg(0); +default: + llvm_unreachable("unhandled lane op"); +} + }; + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +auto IsPtr = Ty.isPointer(); +Src0 = IsPtr ? B.buildPtrToInt(S32, Src0).getReg(0) + : B.buildBitcast(S32, Src0).getReg(0); arsenm wrote: You should not need these casts. Any legal type should be directly accepted without these https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +B.buildBitcast(DstReg, LaneOpDstReg); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDstReg); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); arsenm wrote: You avoid adding extra intermediate bitcasts https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
vikramRH wrote: Added new 32 bit pointer, <8 x i16> tests https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +B.buildBitcast(DstReg, LaneOpDstReg); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDstReg); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); vikramRH wrote: done I hope as per the expectation, however I don't understand the plus here https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH deleted https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +B.buildBitcast(DstReg, LaneOpDstReg); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDstReg); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); vikramRH wrote: Do you mean extract s16 elements individually and handle them as (Size < 32) case ? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { vikramRH wrote: my bad, I will improve the helper https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -2176,26 +2176,23 @@ def int_amdgcn_wave_reduce_umin : AMDGPUWaveReduce; def int_amdgcn_wave_reduce_umax : AMDGPUWaveReduce; def int_amdgcn_readfirstlane : - ClangBuiltin<"__builtin_amdgcn_readfirstlane">, - Intrinsic<[llvm_i32_ty], [llvm_i32_ty], + Intrinsic<[llvm_any_ty], [LLVMMatchType<0>], [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>; // The lane argument must be uniform across the currently active threads of the // current wave. Otherwise, the result is undefined. def int_amdgcn_readlane : - ClangBuiltin<"__builtin_amdgcn_readlane">, - Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], + Intrinsic<[llvm_any_ty], [LLVMMatchType<0>, llvm_i32_ty], [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, IntrNoFree]>; // The value to write and lane select arguments must be uniform across the // currently active threads of the current wave. Otherwise, the result is // undefined. def int_amdgcn_writelane : - ClangBuiltin<"__builtin_amdgcn_writelane">, - Intrinsic<[llvm_i32_ty], [ -llvm_i32_ty,// uniform value to write: returned by the selected lane -llvm_i32_ty,// uniform lane select -llvm_i32_ty // returned by all lanes other than the selected one + Intrinsic<[llvm_any_ty], [ +LLVMMatchType<0>,// uniform value to write: returned by the selected lane +llvm_i32_ty,// uniform lane select +LLVMMatchType<0> // returned by all lanes other than the selected one arsenm wrote: Comments are no longer aligned https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +B.buildBitcast(DstReg, LaneOpDstReg); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDstReg); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); arsenm wrote: For the multiple of `<2 x s16>`, it's a bit nicer to preserve the 16-bit element types https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +B.buildBitcast(DstReg, LaneOpDstReg); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} +} + +Register LaneOpDstReg = LaneOpDst.getReg(0); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOpDstReg); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0Parts.getReg(i)) + .addUse(Src1)) +.getReg(0)); + break; +} +case Intrinsic::amdgcn_readfirstlane: { + + for (unsigned i = 0; i < NumParts; ++i) +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32}) + .addUse(Src0Parts.getReg(i))) +.getReg(0)); + + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src1 = MI.getOperand(3).getReg(); + Register Src2 = MI.getOperand(4).getReg(); + auto Src2Parts = B.buildUnmerge(S32, Src2); + + for (unsigned i = 0; i < NumParts; ++i) +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_writelane, {S32}) + .addUse(Src0Parts.getReg(i)) + .addUse(Src1) + .addUse(Src2Parts.getReg(i))) +.getReg(0)); +} +} + +if (Ty.isPointerVector()) { + auto MergedVec = B.buildMergeLikeInstr( + LLT::vector(ElementCount::getFixed(NumParts), S32), PartialRes); + B.buildBitcast(DstReg, MergedVec); arsenm wrote: You cannot bitcast from an i32 vector to a pointer. You can merge the i32 pieces directly into the pointer though https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); arsenm wrote: Missing break https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { arsenm wrote: This isn't quite what I meant; I mean trying to handle these all as they were the same create lane was ugly https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -493,8 +493,8 @@ Value *AMDGPUAtomicOptimizerImpl::buildScan(IRBuilder<> , if (!ST->isWave32()) { // Combine lane 31 into lanes 32..63. V = B.CreateBitCast(V, IntNTy); - Value *const Lane31 = B.CreateIntrinsic(Intrinsic::amdgcn_readlane, {}, - {V, B.getInt32(31)}); + Value *const Lane31 = B.CreateIntrinsic( + Intrinsic::amdgcn_readlane, B.getInt32Ty(), {V, B.getInt32(31)}); jayfoad wrote: Changes like this should disappear if we merge #91583 first. https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { vikramRH wrote: I removed the helper in the recent commit following @arsenm's suggestion. Only reason is readability https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
vikramRH wrote: > > add f32 pattern to select read/writelane operations > > Why would you need this? Don't you legalize f32 to i32? Sorry about this. Its a leftover comment from the initial implementation which I should have removed. https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); jayfoad wrote: I think you can just reuse Src0 instead of declaring a new Src0Valid. Same for Src2, and same for the SDAG code. https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,130 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register , Register , + Register ) -> Register { +auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0); +if (Src2.isValid()) + return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0); +if (Src1.isValid()) + return (LaneOpDst.addUse(Src1)).getReg(0); +return LaneOpDst.getReg(0); + }; + + Register Src1, Src2, Src0Valid, Src2Valid; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +if (Src2.isValid()) + Src2Valid = B.buildBitcast(S32, Src2).getReg(0); +Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid); +B.buildBitcast(DstReg, LaneOp); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} +Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOp); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOp); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0Parts.getReg(i)) + .addUse(Src1)) +.getReg(0)); jayfoad wrote: Yes, separate patch https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,153 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + Register Src1, Src2; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +MachineInstrBuilder LaneOpDst; +switch (IID) { jayfoad wrote: Can you use a `createLaneOp` helper to build the intrinsic, like you do in the SDAG path? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/jayfoad commented: LGTM overall. > add f32 pattern to select read/writelane operations Why would you need this? Don't you legalize f32 to i32? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,130 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register , Register , + Register ) -> Register { +auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0); +if (Src2.isValid()) + return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0); +if (Src1.isValid()) + return (LaneOpDst.addUse(Src1)).getReg(0); +return LaneOpDst.getReg(0); + }; + + Register Src1, Src2, Src0Valid, Src2Valid; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +if (Src2.isValid()) + Src2Valid = B.buildBitcast(S32, Src2).getReg(0); +Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid); +B.buildBitcast(DstReg, LaneOp); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} +Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOp); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOp); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0Parts.getReg(i)) + .addUse(Src1)) +.getReg(0)); vikramRH wrote: should this be a seperate change that addresses other such instances too ? https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
github-actions[bot] wrote: :warning: C/C++ code formatter, clang-format found issues in your code. :warning: You can test this locally with the following command: ``bash git-clang-format --diff 3d56ea05b6c746a7144f643bef2ebd599f605b8b d0610c47c0e2cb4bca5f90c289ffaa5c4178547f -- clang/lib/CodeGen/CGBuiltin.cpp llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h llvm/lib/Target/AMDGPU/SIISelLowering.cpp `` View the diff from clang-format here. ``diff diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp index 16d3219a16..b0a2bbeb61 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp @@ -5415,18 +5415,21 @@ bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); MachineInstrBuilder LaneOpDst; switch (IID) { - case Intrinsic::amdgcn_readfirstlane: { -LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); -break; - } - case Intrinsic::amdgcn_readlane: { -LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); -break; - } - case Intrinsic::amdgcn_writelane: { -Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); -LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1).addUse(Src2Valid); - } +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} } Register LaneOpDstReg = LaneOpDst.getReg(0); @@ -5443,21 +5446,25 @@ bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , MachineInstrBuilder LaneOpDst; switch (IID) { - case Intrinsic::amdgcn_readfirstlane: { -LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); -break; - } - case Intrinsic::amdgcn_readlane: { -LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); -break; - } - case Intrinsic::amdgcn_writelane: { -Register Src2Cast = MRI.getType(Src2).isScalar() -? Src2 -: B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); -Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); -LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1).addUse(Src2Valid); - } +case Intrinsic::amdgcn_readfirstlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid); + break; +} +case Intrinsic::amdgcn_readlane: { + LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1); + break; +} +case Intrinsic::amdgcn_writelane: { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); + LaneOpDst = B.buildIntrinsic(IID, {S32}) + .addUse(Src0Valid) + .addUse(Src1) + .addUse(Src2Valid); +} } Register LaneOpDstReg = LaneOpDst.getReg(0); `` https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -504,3 +508,16 @@ def AMDGPUdiv_fmas : PatFrags<(ops node:$src0, node:$src1, node:$src2, node:$vcc def AMDGPUperm : PatFrags<(ops node:$src0, node:$src1, node:$src2), [(int_amdgcn_perm node:$src0, node:$src1, node:$src2), (AMDGPUperm_impl node:$src0, node:$src1, node:$src2)]>; + +def AMDGPUreadlane : PatFrags<(ops node:$src0, node:$src1), + [(int_amdgcn_readlane node:$src0, node:$src1), + (AMDGPUreadlane_impl node:$src0, node:$src1)]>; + +def AMDGPUreadfirstlane : PatFrags<(ops node:$src), + [(int_amdgcn_readfirstlane node:$src), + (AMDGPUreadfirstlane_impl node:$src)]>; + +def AMDGPUwritelane : PatFrags<(ops node:$src0, node:$src1, node:$src2), + [(int_amdgcn_writelane node:$src0, node:$src1, node:$src2), + (AMDGPUwritelane_impl node:$src0, node:$src1, node:$src2)]>; + arsenm wrote: Missing newline at end of file https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,130 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register , Register , arsenm wrote: I think this helper is just making things more confusing. You can just handle the 3 cases separately with unmerge logic https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,130 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register , Register , + Register ) -> Register { +auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0); +if (Src2.isValid()) + return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0); +if (Src1.isValid()) + return (LaneOpDst.addUse(Src1)).getReg(0); +return LaneOpDst.getReg(0); + }; + + Register Src1, Src2, Src0Valid, Src2Valid; + if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) { +Src1 = MI.getOperand(3).getReg(); +if (IID == Intrinsic::amdgcn_writelane) { + Src2 = MI.getOperand(4).getReg(); +} + } + + LLT Ty = MRI.getType(DstReg); + unsigned Size = Ty.getSizeInBits(); + + if (Size == 32) { +if (Ty.isScalar()) + // Already legal + return true; + +Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0); +if (Src2.isValid()) + Src2Valid = B.buildBitcast(S32, Src2).getReg(0); +Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid); +B.buildBitcast(DstReg, LaneOp); +MI.eraseFromParent(); +return true; + } + + if (Size < 32) { +Register Src0Cast = MRI.getType(Src0).isScalar() +? Src0 +: B.buildBitcast(LLT::scalar(Size), Src0).getReg(0); +Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0); + +if (Src2.isValid()) { + Register Src2Cast = + MRI.getType(Src2).isScalar() + ? Src2 + : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0); + Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0); +} +Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid); +if (Ty.isScalar()) + B.buildTrunc(DstReg, LaneOp); +else { + auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOp); + B.buildBitcast(DstReg, Trunc); +} + +MI.eraseFromParent(); +return true; + } + + if ((Size % 32) == 0) { +SmallVector PartialRes; +unsigned NumParts = Size / 32; +auto Src0Parts = B.buildUnmerge(S32, Src0); + +switch (IID) { +case Intrinsic::amdgcn_readlane: { + Register Src1 = MI.getOperand(3).getReg(); + for (unsigned i = 0; i < NumParts; ++i) +PartialRes.push_back( +(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32}) + .addUse(Src0Parts.getReg(i)) + .addUse(Src1)) +.getReg(0)); arsenm wrote: We should really add a buildIntrinsic overload that just takes the array of inputs like for other instructions https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5982,6 +5982,68 @@ static SDValue lowerBALLOTIntrinsic(const SITargetLowering , SDNode *N, DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE)); } +static SDValue lowerLaneOp(const SITargetLowering , SDNode *N, + SelectionDAG ) { + EVT VT = N->getValueType(0); + unsigned ValSize = VT.getSizeInBits(); + unsigned IntrinsicID = N->getConstantOperandVal(0); + SDValue Src0 = N->getOperand(1); + SDLoc SL(N); + MVT IntVT = MVT::getIntegerVT(ValSize); + + auto createLaneOp = [&](SDValue , SDValue , SDValue , arsenm wrote: SDValue should be passed by value https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,130 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register , Register , + Register ) -> Register { arsenm wrote: Register should be passed by value https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
@@ -5386,6 +5386,130 @@ bool AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper , return true; } +bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper , + MachineInstr , + Intrinsic::ID IID) const { + + MachineIRBuilder = Helper.MIRBuilder; + MachineRegisterInfo = *B.getMRI(); + + Register DstReg = MI.getOperand(0).getReg(); + Register Src0 = MI.getOperand(2).getReg(); + + auto createLaneOp = [&](Register , Register , + Register ) -> Register { +auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0); +if (Src2.isValid()) + return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0); +if (Src1.isValid()) + return (LaneOpDst.addUse(Src1)).getReg(0); arsenm wrote: Extra parentheses around this https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)
https://github.com/vikramRH edited https://github.com/llvm/llvm-project/pull/89217 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits