[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm commented:

Should lose the [WIP] in the title 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Vikram Hegde via cfe-commits

vikramRH wrote:

> > 1. What's the proper way to legalize f16 and bf16 for SDAG case without 
> > bitcasts ? (I would think  "fp_extend -> LaneOp -> Fptrunc" is wrong)
> 
> Bitcast to i16, anyext to i32, laneop, trunc to i16, bitcast to original type.
> 
> Why wouldn't you use bitcasts?

Just a doubt I had on previous comments, sorry for the noise !

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Matt Arsenault via cfe-commits


@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const 
SITargetLowering , SDNode *N,
   DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
 }
 
+static SDValue lowerLaneOp(const SITargetLowering , SDNode *N,
+   SelectionDAG ) {
+  EVT VT = N->getValueType(0);
+  unsigned ValSize = VT.getSizeInBits();
+  unsigned IntrinsicID = N->getConstantOperandVal(0);
+  SDValue Src0 = N->getOperand(1);
+  SDLoc SL(N);
+  MVT IntVT = MVT::getIntegerVT(ValSize);
+
+  auto createLaneOp = [, ](SDValue Src0, SDValue Src1, SDValue Src2,
+  MVT VT) -> SDValue {
+return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, 
Src2})
+: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1})
+   : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0}));
+  };
+
+  SDValue Src1, Src2;
+  if (IntrinsicID == Intrinsic::amdgcn_readlane ||
+  IntrinsicID == Intrinsic::amdgcn_writelane) {
+Src1 = N->getOperand(2);
+if (IntrinsicID == Intrinsic::amdgcn_writelane)
+  Src2 = N->getOperand(3);
+  }
+
+  if (ValSize == 32) {
+// Already legal
+return SDValue();
+  }
+
+  if (ValSize < 32) {
+SDValue InitBitCast = DAG.getBitcast(IntVT, Src0);
+Src0 = DAG.getAnyExtOrTrunc(InitBitCast, SL, MVT::i32);
+if (Src2.getNode()) {
+  SDValue Src2Cast = DAG.getBitcast(IntVT, Src2);

arsenm wrote:

Yes, bitcast for the f16/bf16 case to get to the int 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Matt Arsenault via cfe-commits


@@ -5456,43 +5444,32 @@ bool 
AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
   if ((Size % 32) == 0) {
 SmallVector PartialRes;
 unsigned NumParts = Size / 32;
-auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;
+bool IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;

arsenm wrote:

Better to track this as the LLT to use for the pieces, rather than making it 
this conditional thing. This will simplify improved pointer handling in the 
future 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Jay Foad via cfe-commits

jayfoad wrote:

> 1. What's the proper way to legalize f16 and bf16 for SDAG case without 
> bitcasts ? (I would think  "fp_extend -> LaneOp -> Fptrunc" is wrong)

Bitcast to i16, anyext to i32, laneop, trunc to i16, bitcast to original type.

Why wouldn't you use bitcasts?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Vikram Hegde via cfe-commits

vikramRH wrote:

updated the GIsel legalizer, I still have couple of questions for SDAG case 
though,
1. What's the proper way to legalize f16 and bf16 for SDAG case without 
bitcasts ? (I would think  "fp_extend -> LaneOp -> Fptrunc" is wrong)
2. For scalar cases such as i64, f64, i128 .. (i.e 32 bit multiples), I guess 
bitcast to vectors (v2i32, v2f32, v4i32) is unavoidable since "UnrollVectorOp" 
wouldn't work otherwise. any alternalte suggestions here ?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Vikram Hegde via cfe-commits


@@ -5387,6 +5387,192 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+// Already legal
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0);
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+
+Register LaneOpDst = createLaneOp(Src0, Src1, Src2);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDst);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;
+MachineInstrBuilder Src0Parts;
+
+if (Ty.isPointer()) {
+  auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else if (Ty.isPointerVector()) {
+  LLT IntVecTy = Ty.changeElementType(
+  LLT::scalar(Ty.getElementType().getSizeInBits()));
+  auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else
+  Src0Parts =
+  IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0)
+ .addUse(Src1))
+.getReg(0));
+  }
+  break;
+}
+case Intrinsic::amdgcn_readfirstlane: {
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32})
+ .addUse(Src0)
+ .getReg(0)));
+  }
+
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  Register Src2 = MI.getOperand(4).getReg();
+  MachineInstrBuilder Src2Parts;
+
+  if (Ty.isPointer()) {
+auto PtrToInt = B.buildPtrToInt(S64, Src2);
+Src2Parts = B.buildUnmerge(S32, PtrToInt);
+  } else if (Ty.isPointerVector()) {
+LLT IntVecTy = Ty.changeElementType(
+LLT::scalar(Ty.getElementType().getSizeInBits()));
+auto PtrToInt = B.buildPtrToInt(IntVecTy, Src2);
+Src2Parts = B.buildUnmerge(S32, PtrToInt);
+  } else
+Src2Parts =
+IsS16Vec ? B.buildUnmerge(V2S16, Src2) : B.buildUnmerge(S32, Src2);

vikramRH wrote:

done

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-23 Thread Vikram Hegde via cfe-commits


@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const 
SITargetLowering , SDNode *N,
   DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
 }
 
+static SDValue lowerLaneOp(const SITargetLowering , SDNode *N,
+   SelectionDAG ) {
+  EVT VT = N->getValueType(0);
+  unsigned ValSize = VT.getSizeInBits();
+  unsigned IntrinsicID = N->getConstantOperandVal(0);
+  SDValue Src0 = N->getOperand(1);
+  SDLoc SL(N);
+  MVT IntVT = MVT::getIntegerVT(ValSize);
+
+  auto createLaneOp = [, ](SDValue Src0, SDValue Src1, SDValue Src2,
+  MVT VT) -> SDValue {
+return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, 
Src2})
+: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1})
+   : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0}));
+  };
+
+  SDValue Src1, Src2;
+  if (IntrinsicID == Intrinsic::amdgcn_readlane ||
+  IntrinsicID == Intrinsic::amdgcn_writelane) {
+Src1 = N->getOperand(2);
+if (IntrinsicID == Intrinsic::amdgcn_writelane)
+  Src2 = N->getOperand(3);
+  }
+
+  if (ValSize == 32) {
+// Already legal
+return SDValue();
+  }
+
+  if (ValSize < 32) {
+SDValue InitBitCast = DAG.getBitcast(IntVT, Src0);
+Src0 = DAG.getAnyExtOrTrunc(InitBitCast, SL, MVT::i32);
+if (Src2.getNode()) {
+  SDValue Src2Cast = DAG.getBitcast(IntVT, Src2);

vikramRH wrote:

What would be the proper way to legalize f16 and bf16 for SDAG case without 
bitcasts ? (Im currently thinking  "fp_extend -> LaneOp -> Fptrunc" which seems 
wrong)

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits


@@ -5387,6 +5387,192 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+// Already legal
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0);
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+
+Register LaneOpDst = createLaneOp(Src0, Src1, Src2);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDst);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;

arsenm wrote:

no auto 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits


@@ -5387,6 +5387,192 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+// Already legal
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0);
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+
+Register LaneOpDst = createLaneOp(Src0, Src1, Src2);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDst);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;
+MachineInstrBuilder Src0Parts;
+
+if (Ty.isPointer()) {
+  auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else if (Ty.isPointerVector()) {
+  LLT IntVecTy = Ty.changeElementType(
+  LLT::scalar(Ty.getElementType().getSizeInBits()));
+  auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else
+  Src0Parts =
+  IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0)
+ .addUse(Src1))
+.getReg(0));
+  }
+  break;
+}
+case Intrinsic::amdgcn_readfirstlane: {
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32})
+ .addUse(Src0)
+ .getReg(0)));
+  }
+
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  Register Src2 = MI.getOperand(4).getReg();
+  MachineInstrBuilder Src2Parts;
+
+  if (Ty.isPointer()) {
+auto PtrToInt = B.buildPtrToInt(S64, Src2);
+Src2Parts = B.buildUnmerge(S32, PtrToInt);
+  } else if (Ty.isPointerVector()) {
+LLT IntVecTy = Ty.changeElementType(
+LLT::scalar(Ty.getElementType().getSizeInBits()));
+auto PtrToInt = B.buildPtrToInt(IntVecTy, Src2);
+Src2Parts = B.buildUnmerge(S32, PtrToInt);
+  } else
+Src2Parts =
+IsS16Vec ? B.buildUnmerge(V2S16, Src2) : B.buildUnmerge(S32, Src2);

arsenm wrote:

The point of splitting out the pointer typed tests was to avoid fixing the 
handling of the pointer typed selection patterns. You still have the casts 
inserted here. You should not need any ptrtoint, inttoptr, or bitcasts in any 
of these legalizations 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org

[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm requested changes to this pull request.

There should be no need to introduce same-sized value casts, whether bitcast or 
ptrtoint in either legalizer 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits


@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const 
SITargetLowering , SDNode *N,
   DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
 }
 
+static SDValue lowerLaneOp(const SITargetLowering , SDNode *N,
+   SelectionDAG ) {
+  EVT VT = N->getValueType(0);
+  unsigned ValSize = VT.getSizeInBits();
+  unsigned IntrinsicID = N->getConstantOperandVal(0);
+  SDValue Src0 = N->getOperand(1);
+  SDLoc SL(N);
+  MVT IntVT = MVT::getIntegerVT(ValSize);
+
+  auto createLaneOp = [, ](SDValue Src0, SDValue Src1, SDValue Src2,
+  MVT VT) -> SDValue {
+return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, 
Src2})
+: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1})
+   : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0}));
+  };
+
+  SDValue Src1, Src2;
+  if (IntrinsicID == Intrinsic::amdgcn_readlane ||
+  IntrinsicID == Intrinsic::amdgcn_writelane) {
+Src1 = N->getOperand(2);
+if (IntrinsicID == Intrinsic::amdgcn_writelane)
+  Src2 = N->getOperand(3);
+  }
+
+  if (ValSize == 32) {
+// Already legal
+return SDValue();
+  }
+
+  if (ValSize < 32) {
+SDValue InitBitCast = DAG.getBitcast(IntVT, Src0);
+Src0 = DAG.getAnyExtOrTrunc(InitBitCast, SL, MVT::i32);
+if (Src2.getNode()) {
+  SDValue Src2Cast = DAG.getBitcast(IntVT, Src2);

arsenm wrote:

You should not have any bitcasts anywhere 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits


@@ -5387,6 +5387,192 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+// Already legal
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0);
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+
+Register LaneOpDst = createLaneOp(Src0, Src1, Src2);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDst);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;
+MachineInstrBuilder Src0Parts;
+
+if (Ty.isPointer()) {
+  auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else if (Ty.isPointerVector()) {
+  LLT IntVecTy = Ty.changeElementType(
+  LLT::scalar(Ty.getElementType().getSizeInBits()));
+  auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else
+  Src0Parts =
+  IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0)
+ .addUse(Src1))
+.getReg(0));
+  }
+  break;
+}
+case Intrinsic::amdgcn_readfirstlane: {
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32})
+ .addUse(Src0)
+ .getReg(0)));
+  }
+
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  Register Src2 = MI.getOperand(4).getReg();
+  MachineInstrBuilder Src2Parts;
+
+  if (Ty.isPointer()) {
+auto PtrToInt = B.buildPtrToInt(S64, Src2);
+Src2Parts = B.buildUnmerge(S32, PtrToInt);
+  } else if (Ty.isPointerVector()) {
+LLT IntVecTy = Ty.changeElementType(
+LLT::scalar(Ty.getElementType().getSizeInBits()));
+auto PtrToInt = B.buildPtrToInt(IntVecTy, Src2);
+Src2Parts = B.buildUnmerge(S32, PtrToInt);
+  } else
+Src2Parts =
+IsS16Vec ? B.buildUnmerge(V2S16, Src2) : B.buildUnmerge(S32, Src2);
+
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+Src2 = IsS16Vec ? B.buildBitcast(S32, Src2Parts.getReg(i)).getReg(0)
+: Src2Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_writelane, {S32})
+ .addUse(Src0)
+ 

[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits


@@ -5387,6 +5387,192 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+// Already legal
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0 = B.buildAnyExt(S32, Src0Cast).getReg(0);
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2 = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+
+Register LaneOpDst = createLaneOp(Src0, Src1, Src2);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDst);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDst);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto IsS16Vec = Ty.isVector() && Ty.getElementType() == S16;
+MachineInstrBuilder Src0Parts;
+
+if (Ty.isPointer()) {
+  auto PtrToInt = B.buildPtrToInt(LLT::scalar(Size), Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else if (Ty.isPointerVector()) {
+  LLT IntVecTy = Ty.changeElementType(
+  LLT::scalar(Ty.getElementType().getSizeInBits()));
+  auto PtrToInt = B.buildPtrToInt(IntVecTy, Src0);
+  Src0Parts = B.buildUnmerge(S32, PtrToInt);
+} else
+  Src0Parts =
+  IsS16Vec ? B.buildUnmerge(V2S16, Src0) : B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)
+: Src0Parts.getReg(i);
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0)
+ .addUse(Src1))
+.getReg(0));
+  }
+  break;
+}
+case Intrinsic::amdgcn_readfirstlane: {
+  for (unsigned i = 0; i < NumParts; ++i) {
+Src0 = IsS16Vec ? B.buildBitcast(S32, Src0Parts.getReg(i)).getReg(0)

arsenm wrote:

No bitcasts 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-20 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm edited https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-18 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-18 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-18 Thread Vikram Hegde via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

vikramRH wrote:

Done

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-17 Thread Matt Arsenault via cfe-commits


@@ -6086,6 +6086,62 @@ static SDValue lowerBALLOTIntrinsic(const 
SITargetLowering , SDNode *N,
   DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
 }
 
+static SDValue lowerLaneOp(const SITargetLowering , SDNode *N,
+   SelectionDAG ) {
+  EVT VT = N->getValueType(0);
+  unsigned ValSize = VT.getSizeInBits();
+  unsigned IntrinsicID = N->getConstantOperandVal(0);
+  SDValue Src0 = N->getOperand(1);
+  SDLoc SL(N);
+  MVT IntVT = MVT::getIntegerVT(ValSize);
+
+  auto createLaneOp = [&](SDValue Src0, SDValue Src1, SDValue Src2,
+  MVT VT) -> SDValue {

arsenm wrote:

? VT shadow 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-17 Thread Matt Arsenault via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

arsenm wrote:

I'd rather just leave the pointer cases failing in the global isel case, and 
remove all the cast insertion bits.
You could split out the pointer cases to a separate file and just not run it 
with globalisel. 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

vikramRH wrote:

Do you think these changes are okay until I figure out root cause ?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Matt Arsenault via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

arsenm wrote:

I think GlobalISelEmitter is just ignoring PtrValueType

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Matt Arsenault via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

arsenm wrote:

Seems like this is just a GlobalISelEmitter bug (and another example of we we 
should have a way to write patterns that only care about the bit size) 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits


@@ -342,6 +342,22 @@ def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
 
 def AMDGPUperm_impl : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>;
 
+def AMDGPUReadfirstlaneOp : SDTypeProfile<1, 1, [
+  SDTCisSameAs<0, 1>
+]>;
+
+def AMDGPUReadlaneOp : SDTypeProfile<1, 2, [
+  SDTCisSameAs<0, 1>, SDTCisInt<2>
+]>;
+
+def AMDGPUDWritelaneOp : SDTypeProfile<1, 3, [
+  SDTCisSameAs<1, 1>, SDTCisInt<2>, SDTCisSameAs<0, 3>,

vikramRH wrote:

Thanks for pointing this, missed updating this latest version. updated now, 
however issue is not related to this

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Diana Picus via cfe-commits


@@ -342,6 +342,22 @@ def AMDGPUfdot2_impl : SDNode<"AMDGPUISD::FDOT2",
 
 def AMDGPUperm_impl : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>;
 
+def AMDGPUReadfirstlaneOp : SDTypeProfile<1, 1, [
+  SDTCisSameAs<0, 1>
+]>;
+
+def AMDGPUReadlaneOp : SDTypeProfile<1, 2, [
+  SDTCisSameAs<0, 1>, SDTCisInt<2>
+]>;
+
+def AMDGPUDWritelaneOp : SDTypeProfile<1, 3, [
+  SDTCisSameAs<1, 1>, SDTCisInt<2>, SDTCisSameAs<0, 3>,

rovka wrote:

<0,1>

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

vikramRH wrote:

Attaching example match table snippets for v2i16 and p3 here, should make the 
scenario bit more clear,
for v2i16
 ```
GIM_Try, /*On fail goto*//*Label 3499*/ GIMT_Encode4(202699), // Rule ID 2117 //
GIM_CheckIntrinsicID, /*MI*/0, /*Op*/1, 
GIMT_Encode2(Intrinsic::amdgcn_writelane),
GIM_RootCheckType, /*Op*/0, /*Type*/GILLT_v2s16,
GIM_RootCheckType, /*Op*/2, /*Type*/GILLT_v2s16,
GIM_RootCheckType, /*Op*/3, /*Type*/GILLT_s32,
GIM_RootCheckType, /*Op*/4, /*Type*/GILLT_v2s16,
GIM_RootCheckRegBankForClass, /*Op*/0, 
/*RC*/GIMT_Encode2(AMDGPU::VGPR_32RegClassID),
// (intrinsic_wo_chain:{ *:[v2i16] } 2863:{ *:[iPTR] }, v2i16:{ 
*:[v2i16] }:$src0, i32:{ *:[i32] }:$src1, v2i16:{ *:[v2i16] }:$src2)  =>  
(V_WRITELANE_B32:{ *:[v2i16] } SCSrc_b32:{ *:[v2i16] }:$src0, SCSrc_b32:{ 
*:[i32] }:$src1, VGPR_32:{ *:[v2i16] }:$src2)
GIR_BuildRootMI, /*Opcode*/GIMT_Encode2(AMDGPU::V_WRITELANE_B32),
```

and for p3,
```
GIM_Try, /*On fail goto*//*Label 3502*/ GIMT_Encode4(202816), // Rule ID 2129 //
GIM_CheckIntrinsicID, /*MI*/0, /*Op*/1, 
GIMT_Encode2(Intrinsic::amdgcn_writelane),
GIM_RootCheckType, /*Op*/0, /*Type*/GILLT_s32,
GIM_RootCheckType, /*Op*/2, /*Type*/GILLT_p2s32,
GIM_RootCheckType, /*Op*/3, /*Type*/GILLT_s32,
GIM_RootCheckType, /*Op*/4, /*Type*/GILLT_p2s32,
GIM_RootCheckRegBankForClass, /*Op*/0, 
/*RC*/GIMT_Encode2(AMDGPU::VGPR_32RegClassID),
// (intrinsic_wo_chain:{ *:[i32] } 2863:{ *:[iPTR] }, p2:{ *:[i32] 
}:$src0, i32:{ *:[i32] }:$src1, p2:{ *:[i32] }:$src2)  =>  (V_WRITELANE_B32:{ 
*:[i32] } SCSrc_b32:{ *:[i32] }:$src0, SCSrc_b32:{ *:[i32] }:$src1, VGPR_32:{ 
*:[i32] }:$src2)
GIR_BuildRootMI, /*Opcode*/GIMT_Encode2(AMDGPU::V_WRITELANE_B32),
```

The destination type check for p3 case is still for "GILLT_s32",



https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Matt Arsenault via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

arsenm wrote:

What is the problem? 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Vikram Hegde via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

vikramRH wrote:

Unfortunately no, Had tried this and couple of other variations. the issue 
seems to be too specific to GIsel pointers..

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Matt Arsenault via cfe-commits


@@ -780,14 +780,22 @@ defm V_SUBREV_U32 : VOP2Inst <"v_subrev_u32", 
VOP_I32_I32_I32_ARITH, null_frag,
 
 // These are special and do not read the exec mask.
 let isConvergent = 1, Uses = [] in {
-def V_READLANE_B32 : VOP2_Pseudo<"v_readlane_b32", VOP_READLANE,
-  [(set i32:$vdst, (int_amdgcn_readlane i32:$src0, i32:$src1))]>;
+def V_READLANE_B32 : VOP2_Pseudo<"v_readlane_b32", VOP_READLANE,[]>;
 let IsNeverUniform = 1, Constraints = "$vdst = $vdst_in", 
DisableEncoding="$vdst_in" in {
-def V_WRITELANE_B32 : VOP2_Pseudo<"v_writelane_b32", VOP_WRITELANE,
-  [(set i32:$vdst, (int_amdgcn_writelane i32:$src0, i32:$src1, 
i32:$vdst_in))]>;
+def V_WRITELANE_B32 : VOP2_Pseudo<"v_writelane_b32", VOP_WRITELANE, []>;
 } // End IsNeverUniform, $vdst = $vdst_in, DisableEncoding $vdst_in
 } // End isConvergent = 1
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadlane vt:$src0, i32:$src1)),
+(V_READLANE_B32 VRegOrLdsSrc_32:$src0, SCSrc_b32:$src1)

arsenm wrote:

Same here, supply the type to the result instruction 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-16 Thread Matt Arsenault via cfe-commits


@@ -243,11 +243,16 @@ def VOP_READFIRSTLANE : VOPProfile <[i32, i32, untyped, 
untyped]> {
 // FIXME: Specify SchedRW for READFIRSTLANE_B32
 // TODO: There is VOP3 encoding also
 def V_READFIRSTLANE_B32 : VOP1_Pseudo <"v_readfirstlane_b32", 
VOP_READFIRSTLANE,
-   getVOP1Pat.ret, 1> {
+   [], 1> {
   let isConvergent = 1;
 }
 
+foreach vt = Reg32Types.types in {
+  def : GCNPat<(vt (AMDGPUreadfirstlane (vt VRegOrLdsSrc_32:$src0))),
+(V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0))

arsenm wrote:

```suggestion
(vt (V_READFIRSTLANE_B32 (vt VRegOrLdsSrc_32:$src0)))
```

This may fix your pattern type deduction issue 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-15 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-15 Thread Vikram Hegde via cfe-commits


@@ -5387,6 +5387,212 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal

vikramRH wrote:

Also the issue is only for pointer types, float, v2i16 etc work just fine

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-15 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-15 Thread Vikram Hegde via cfe-commits


@@ -5387,6 +5387,212 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal

vikramRH wrote:

Done except for pointers. I currently see an issue where pattern type inference 
somehow deduces destination type to scalars (instead of say LLT_ p3s32). not 
currently sure why , any ideas ?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Matt Arsenault via cfe-commits


@@ -6086,6 +6086,68 @@ static SDValue lowerBALLOTIntrinsic(const 
SITargetLowering , SDNode *N,
   DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
 }
 
+static SDValue lowerLaneOp(const SITargetLowering , SDNode *N,
+   SelectionDAG ) {
+  EVT VT = N->getValueType(0);
+  unsigned ValSize = VT.getSizeInBits();
+  unsigned IntrinsicID = N->getConstantOperandVal(0);
+  SDValue Src0 = N->getOperand(1);
+  SDLoc SL(N);
+  MVT IntVT = MVT::getIntegerVT(ValSize);
+
+  auto createLaneOp = [&](SDValue Src0, SDValue Src1, SDValue Src2,
+  MVT VT) -> SDValue {
+return (Src2 ? DAG.getNode(AMDGPUISD::WRITELANE, SL, VT, {Src0, Src1, 
Src2})
+: Src1 ? DAG.getNode(AMDGPUISD::READLANE, SL, VT, {Src0, Src1})
+   : DAG.getNode(AMDGPUISD::READFIRSTLANE, SL, VT, {Src0}));
+  };
+
+  SDValue Src1, Src2;
+  if (IntrinsicID == Intrinsic::amdgcn_readlane ||
+  IntrinsicID == Intrinsic::amdgcn_writelane) {
+Src1 = N->getOperand(2);
+if (IntrinsicID == Intrinsic::amdgcn_writelane)
+  Src2 = N->getOperand(3);
+  }
+
+  if (ValSize == 32) {
+if (VT == MVT::i32)
+  // Already legal
+  return SDValue();
+Src0 = DAG.getBitcast(IntVT, Src0);

arsenm wrote:

Like the other cases, we should be able to avoid intermediate casting 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Matt Arsenault via cfe-commits


@@ -3400,7 +3400,7 @@ def : GCNPat<
 // FIXME: Should also do this for readlane, but tablegen crashes on
 // the ignored src1.
 def : GCNPat<
-  (int_amdgcn_readfirstlane (i32 imm:$src)),
+  (i32 (AMDGPUreadfirstlane (i32 imm:$src))),

arsenm wrote:

We might need to make this fold more sophisticated for other types, but best 
for a follow up patch 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Matt Arsenault via cfe-commits


@@ -5387,6 +5387,212 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal

arsenm wrote:

Either add braces or don't put the comment under the if. Also, the size 32 case 
should be treated as directly legal for 32-bit pointers 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Matt Arsenault via cfe-commits


@@ -5387,6 +5387,212 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register Src0, Register Src1,
+  Register Src2) -> Register {
+auto LaneOp = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane:
+  return LaneOp.getReg(0);
+case Intrinsic::amdgcn_readlane:
+  return LaneOp.addUse(Src1).getReg(0);
+case Intrinsic::amdgcn_writelane:
+  return LaneOp.addUse(Src1).addUse(Src2).getReg(0);
+default:
+  llvm_unreachable("unhandled lane op");
+}
+  };
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+auto IsPtr = Ty.isPointer();
+Src0 = IsPtr ? B.buildPtrToInt(S32, Src0).getReg(0)
+ : B.buildBitcast(S32, Src0).getReg(0);

arsenm wrote:

You should not need these casts. Any legal type should be directly accepted 
without these 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+B.buildBitcast(DstReg, LaneOpDstReg);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDstReg);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);

arsenm wrote:

You avoid adding extra intermediate bitcasts 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Vikram Hegde via cfe-commits

vikramRH wrote:

Added new 32 bit pointer,  <8 x i16> tests

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-13 Thread Vikram Hegde via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+B.buildBitcast(DstReg, LaneOpDstReg);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDstReg);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);

vikramRH wrote:

done I hope as per the expectation, however I don't understand the plus  here

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH deleted 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+B.buildBitcast(DstReg, LaneOpDstReg);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDstReg);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);

vikramRH wrote:

Do you mean extract s16 elements individually and handle them as  (Size < 32) 
case ?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {

vikramRH wrote:

my bad, I will improve the helper

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Matt Arsenault via cfe-commits


@@ -2176,26 +2176,23 @@ def int_amdgcn_wave_reduce_umin : AMDGPUWaveReduce;
 def int_amdgcn_wave_reduce_umax : AMDGPUWaveReduce;
 
 def int_amdgcn_readfirstlane :
-  ClangBuiltin<"__builtin_amdgcn_readfirstlane">,
-  Intrinsic<[llvm_i32_ty], [llvm_i32_ty],
+  Intrinsic<[llvm_any_ty], [LLVMMatchType<0>],
 [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, 
IntrNoFree]>;
 
 // The lane argument must be uniform across the currently active threads of the
 // current wave. Otherwise, the result is undefined.
 def int_amdgcn_readlane :
-  ClangBuiltin<"__builtin_amdgcn_readlane">,
-  Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty],
+  Intrinsic<[llvm_any_ty], [LLVMMatchType<0>, llvm_i32_ty],
 [IntrNoMem, IntrConvergent, IntrWillReturn, IntrNoCallback, 
IntrNoFree]>;
 
 // The value to write and lane select arguments must be uniform across the
 // currently active threads of the current wave. Otherwise, the result is
 // undefined.
 def int_amdgcn_writelane :
-  ClangBuiltin<"__builtin_amdgcn_writelane">,
-  Intrinsic<[llvm_i32_ty], [
-llvm_i32_ty,// uniform value to write: returned by the selected lane
-llvm_i32_ty,// uniform lane select
-llvm_i32_ty // returned by all lanes other than the selected one
+  Intrinsic<[llvm_any_ty], [
+LLVMMatchType<0>,// uniform value to write: returned by the selected 
lane
+llvm_i32_ty,// uniform lane select
+LLVMMatchType<0> // returned by all lanes other than the selected one

arsenm wrote:

Comments are no longer aligned 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+B.buildBitcast(DstReg, LaneOpDstReg);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDstReg);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);

arsenm wrote:

For the multiple of `<2 x s16>`, it's a bit nicer to preserve the 16-bit 
element types 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+B.buildBitcast(DstReg, LaneOpDstReg);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Register Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
+}
+
+Register LaneOpDstReg = LaneOpDst.getReg(0);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOpDstReg);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOpDstReg);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i)
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0Parts.getReg(i))
+ .addUse(Src1))
+.getReg(0));
+  break;
+}
+case Intrinsic::amdgcn_readfirstlane: {
+
+  for (unsigned i = 0; i < NumParts; ++i)
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readfirstlane, {S32})
+ .addUse(Src0Parts.getReg(i)))
+.getReg(0));
+
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  Register Src2 = MI.getOperand(4).getReg();
+  auto Src2Parts = B.buildUnmerge(S32, Src2);
+
+  for (unsigned i = 0; i < NumParts; ++i)
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_writelane, {S32})
+ .addUse(Src0Parts.getReg(i))
+ .addUse(Src1)
+ .addUse(Src2Parts.getReg(i)))
+.getReg(0));
+}
+}
+
+if (Ty.isPointerVector()) {
+  auto MergedVec = B.buildMergeLikeInstr(
+  LLT::vector(ElementCount::getFixed(NumParts), S32), PartialRes);
+  B.buildBitcast(DstReg, MergedVec);

arsenm wrote:

You cannot bitcast from an i32 vector to a pointer. You can merge the i32 
pieces directly into the pointer though 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);

arsenm wrote:

Missing break 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {

arsenm wrote:

This isn't quite what I meant; I mean trying to handle these all as they were 
the same create lane was ugly

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Jay Foad via cfe-commits


@@ -493,8 +493,8 @@ Value *AMDGPUAtomicOptimizerImpl::buildScan(IRBuilder<> ,
 if (!ST->isWave32()) {
   // Combine lane 31 into lanes 32..63.
   V = B.CreateBitCast(V, IntNTy);
-  Value *const Lane31 = B.CreateIntrinsic(Intrinsic::amdgcn_readlane, {},
-  {V, B.getInt32(31)});
+  Value *const Lane31 = B.CreateIntrinsic(
+  Intrinsic::amdgcn_readlane, B.getInt32Ty(), {V, B.getInt32(31)});

jayfoad wrote:

Changes like this should disappear if we merge #91583 first.

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {

vikramRH wrote:

I removed the helper in the recent commit following @arsenm's suggestion. Only 
reason is readability

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits

vikramRH wrote:

> > add f32 pattern to select read/writelane operations
> 
> Why would you need this? Don't you legalize f32 to i32?

Sorry about this. Its a leftover comment from the initial implementation which 
I should have removed.

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Jay Foad via cfe-commits

https://github.com/jayfoad edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Jay Foad via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);

jayfoad wrote:

I think you can just reuse Src0 instead of declaring a new Src0Valid. Same for 
Src2, and same for the SDAG code.

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Jay Foad via cfe-commits


@@ -5386,6 +5386,130 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register , Register ,
+  Register ) -> Register {
+auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+if (Src2.isValid())
+  return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0);
+if (Src1.isValid())
+  return (LaneOpDst.addUse(Src1)).getReg(0);
+return LaneOpDst.getReg(0);
+  };
+
+  Register Src1, Src2, Src0Valid, Src2Valid;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+if (Src2.isValid())
+  Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid);
+B.buildBitcast(DstReg, LaneOp);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOp);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOp);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i)
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0Parts.getReg(i))
+ .addUse(Src1))
+.getReg(0));

jayfoad wrote:

Yes, separate patch

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Jay Foad via cfe-commits


@@ -5386,6 +5386,153 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  Register Src1, Src2;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+MachineInstrBuilder LaneOpDst;
+switch (IID) {

jayfoad wrote:

Can you use a `createLaneOp` helper to build the intrinsic, like you do in the 
SDAG path?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Jay Foad via cfe-commits

https://github.com/jayfoad commented:

LGTM overall.

> add f32 pattern to select read/writelane operations

Why would you need this? Don't you legalize f32 to i32?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread Vikram Hegde via cfe-commits


@@ -5386,6 +5386,130 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register , Register ,
+  Register ) -> Register {
+auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+if (Src2.isValid())
+  return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0);
+if (Src1.isValid())
+  return (LaneOpDst.addUse(Src1)).getReg(0);
+return LaneOpDst.getReg(0);
+  };
+
+  Register Src1, Src2, Src0Valid, Src2Valid;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+if (Src2.isValid())
+  Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid);
+B.buildBitcast(DstReg, LaneOp);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOp);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOp);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i)
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0Parts.getReg(i))
+ .addUse(Src1))
+.getReg(0));

vikramRH wrote:

should this be a seperate change that addresses other such instances too ?

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-09 Thread via cfe-commits

github-actions[bot] wrote:




:warning: C/C++ code formatter, clang-format found issues in your code. 
:warning:



You can test this locally with the following command:


``bash
git-clang-format --diff 3d56ea05b6c746a7144f643bef2ebd599f605b8b 
d0610c47c0e2cb4bca5f90c289ffaa5c4178547f -- clang/lib/CodeGen/CGBuiltin.cpp 
llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp 
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp 
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h 
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp 
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.h 
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
``





View the diff from clang-format here.


``diff
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp 
b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index 16d3219a16..b0a2bbeb61 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -5415,18 +5415,21 @@ bool 
AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
 Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
 MachineInstrBuilder LaneOpDst;
 switch (IID) {
-  case Intrinsic::amdgcn_readfirstlane: {
-LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
-break;
-  }
-  case Intrinsic::amdgcn_readlane: {
-LaneOpDst = B.buildIntrinsic(IID, 
{S32}).addUse(Src0Valid).addUse(Src1);
-break;
-  }
-  case Intrinsic::amdgcn_writelane: {
-Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
-LaneOpDst = B.buildIntrinsic(IID, 
{S32}).addUse(Src0Valid).addUse(Src1).addUse(Src2Valid);
-  }
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
 }
 
 Register LaneOpDstReg = LaneOpDst.getReg(0);
@@ -5443,21 +5446,25 @@ bool 
AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
 
 MachineInstrBuilder LaneOpDst;
 switch (IID) {
-  case Intrinsic::amdgcn_readfirstlane: {
-LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
-break;
-  }
-  case Intrinsic::amdgcn_readlane: {
-LaneOpDst = B.buildIntrinsic(IID, 
{S32}).addUse(Src0Valid).addUse(Src1);
-break;
-  }
-  case Intrinsic::amdgcn_writelane: {
-Register Src2Cast = MRI.getType(Src2).isScalar()
-? Src2
-: B.buildBitcast(LLT::scalar(Size), 
Src2).getReg(0);
-Register Src2Valid = B.buildAnyExt(LLT::scalar(32), 
Src2Cast).getReg(0);
-LaneOpDst = B.buildIntrinsic(IID, 
{S32}).addUse(Src0Valid).addUse(Src1).addUse(Src2Valid);
-  }
+case Intrinsic::amdgcn_readfirstlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid);
+  break;
+}
+case Intrinsic::amdgcn_readlane: {
+  LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0Valid).addUse(Src1);
+  break;
+}
+case Intrinsic::amdgcn_writelane: {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Register Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+  LaneOpDst = B.buildIntrinsic(IID, {S32})
+  .addUse(Src0Valid)
+  .addUse(Src1)
+  .addUse(Src2Valid);
+}
 }
 
 Register LaneOpDstReg = LaneOpDst.getReg(0);

``




https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-07 Thread Matt Arsenault via cfe-commits


@@ -504,3 +508,16 @@ def AMDGPUdiv_fmas : PatFrags<(ops node:$src0, node:$src1, 
node:$src2, node:$vcc
 def AMDGPUperm : PatFrags<(ops node:$src0, node:$src1, node:$src2),
   [(int_amdgcn_perm node:$src0, node:$src1, node:$src2),
(AMDGPUperm_impl node:$src0, node:$src1, node:$src2)]>;
+
+def AMDGPUreadlane : PatFrags<(ops node:$src0, node:$src1),
+  [(int_amdgcn_readlane node:$src0, node:$src1),
+   (AMDGPUreadlane_impl node:$src0, node:$src1)]>;
+
+def AMDGPUreadfirstlane : PatFrags<(ops node:$src),
+  [(int_amdgcn_readfirstlane node:$src),
+   (AMDGPUreadfirstlane_impl node:$src)]>;
+
+def AMDGPUwritelane : PatFrags<(ops node:$src0, node:$src1, node:$src2),
+  [(int_amdgcn_writelane node:$src0, node:$src1, node:$src2),
+   (AMDGPUwritelane_impl node:$src0, node:$src1, node:$src2)]>;
+   

arsenm wrote:

Missing newline at end of file 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-07 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,130 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register , Register ,

arsenm wrote:

I think this helper is just making things more confusing. You can just handle 
the 3 cases separately with unmerge logic 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-07 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,130 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register , Register ,
+  Register ) -> Register {
+auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+if (Src2.isValid())
+  return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0);
+if (Src1.isValid())
+  return (LaneOpDst.addUse(Src1)).getReg(0);
+return LaneOpDst.getReg(0);
+  };
+
+  Register Src1, Src2, Src0Valid, Src2Valid;
+  if (IID == Intrinsic::amdgcn_readlane || IID == Intrinsic::amdgcn_writelane) 
{
+Src1 = MI.getOperand(3).getReg();
+if (IID == Intrinsic::amdgcn_writelane) {
+  Src2 = MI.getOperand(4).getReg();
+}
+  }
+
+  LLT Ty = MRI.getType(DstReg);
+  unsigned Size = Ty.getSizeInBits();
+
+  if (Size == 32) {
+if (Ty.isScalar())
+  // Already legal
+  return true;
+
+Register Src0Valid = B.buildBitcast(S32, Src0).getReg(0);
+if (Src2.isValid())
+  Src2Valid = B.buildBitcast(S32, Src2).getReg(0);
+Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid);
+B.buildBitcast(DstReg, LaneOp);
+MI.eraseFromParent();
+return true;
+  }
+
+  if (Size < 32) {
+Register Src0Cast = MRI.getType(Src0).isScalar()
+? Src0
+: B.buildBitcast(LLT::scalar(Size), 
Src0).getReg(0);
+Src0Valid = B.buildAnyExt(S32, Src0Cast).getReg(0);
+
+if (Src2.isValid()) {
+  Register Src2Cast =
+  MRI.getType(Src2).isScalar()
+  ? Src2
+  : B.buildBitcast(LLT::scalar(Size), Src2).getReg(0);
+  Src2Valid = B.buildAnyExt(LLT::scalar(32), Src2Cast).getReg(0);
+}
+Register LaneOp = createLaneOp(Src0Valid, Src1, Src2Valid);
+if (Ty.isScalar())
+  B.buildTrunc(DstReg, LaneOp);
+else {
+  auto Trunc = B.buildTrunc(LLT::scalar(Size), LaneOp);
+  B.buildBitcast(DstReg, Trunc);
+}
+
+MI.eraseFromParent();
+return true;
+  }
+
+  if ((Size % 32) == 0) {
+SmallVector PartialRes;
+unsigned NumParts = Size / 32;
+auto Src0Parts = B.buildUnmerge(S32, Src0);
+
+switch (IID) {
+case Intrinsic::amdgcn_readlane: {
+  Register Src1 = MI.getOperand(3).getReg();
+  for (unsigned i = 0; i < NumParts; ++i)
+PartialRes.push_back(
+(B.buildIntrinsic(Intrinsic::amdgcn_readlane, {S32})
+ .addUse(Src0Parts.getReg(i))
+ .addUse(Src1))
+.getReg(0));

arsenm wrote:

We should really add a buildIntrinsic overload that just takes the array of 
inputs like for other instructions 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-07 Thread Matt Arsenault via cfe-commits


@@ -5982,6 +5982,68 @@ static SDValue lowerBALLOTIntrinsic(const 
SITargetLowering , SDNode *N,
   DAG.getConstant(0, SL, MVT::i32), DAG.getCondCode(ISD::SETNE));
 }
 
+static SDValue lowerLaneOp(const SITargetLowering , SDNode *N,
+   SelectionDAG ) {
+  EVT VT = N->getValueType(0);
+  unsigned ValSize = VT.getSizeInBits();
+  unsigned IntrinsicID = N->getConstantOperandVal(0);
+  SDValue Src0 = N->getOperand(1);
+  SDLoc SL(N);
+  MVT IntVT = MVT::getIntegerVT(ValSize);
+
+  auto createLaneOp = [&](SDValue , SDValue , SDValue ,

arsenm wrote:

SDValue should be passed by value 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-07 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,130 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register , Register ,
+  Register ) -> Register {

arsenm wrote:

Register should be passed by value 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-07 Thread Matt Arsenault via cfe-commits


@@ -5386,6 +5386,130 @@ bool 
AMDGPULegalizerInfo::legalizeDSAtomicFPIntrinsic(LegalizerHelper ,
   return true;
 }
 
+bool AMDGPULegalizerInfo::legalizeLaneOp(LegalizerHelper ,
+ MachineInstr ,
+ Intrinsic::ID IID) const {
+
+  MachineIRBuilder  = Helper.MIRBuilder;
+  MachineRegisterInfo  = *B.getMRI();
+
+  Register DstReg = MI.getOperand(0).getReg();
+  Register Src0 = MI.getOperand(2).getReg();
+
+  auto createLaneOp = [&](Register , Register ,
+  Register ) -> Register {
+auto LaneOpDst = B.buildIntrinsic(IID, {S32}).addUse(Src0);
+if (Src2.isValid())
+  return (LaneOpDst.addUse(Src1).addUse(Src2)).getReg(0);
+if (Src1.isValid())
+  return (LaneOpDst.addUse(Src1)).getReg(0);

arsenm wrote:

Extra parentheses around this 

https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [AMDGPU][WIP] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (PR #89217)

2024-05-05 Thread Vikram Hegde via cfe-commits

https://github.com/vikramRH edited 
https://github.com/llvm/llvm-project/pull/89217
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits