[PATCH] D97857: [Matrix] Add support for matrix-by-scalar division.

2021-03-03 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added inline comments.



Comment at: clang/test/CodeGen/matrix-type-operators.c:303
 }
 
+// CHECK-LABEL: @divide_double_matrix_scalar_float(

Shouldn't a test for half floating point be added here as well? 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D97857/new/

https://reviews.llvm.org/D97857

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

2021-03-31 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added a comment.

@fhahn When I mentioned the splats I was talking about the IR, not the final 
code. On the Godbolts links you sent, its the same that I see. However take a 
look into the IR your example generates:

  %vec.cast = bitcast [4 x float]* %A to <2 x float>*
  %col.load = load <2 x float>, <2 x float>* %vec.cast, align 4
  %vec.gep = getelementptr [4 x float], [4 x float]* %A, i64 0, i64 2
  %vec.cast2 = bitcast float* %vec.gep to <2 x float>*
  %col.load3 = load <2 x float>, <2 x float>* %vec.cast2, align 4
  %vec.cast4 = bitcast [4 x float]* %B to <2 x float>*
  %col.load5 = load <2 x float>, <2 x float>* %vec.cast4, align 4
  %vec.gep6 = getelementptr [4 x float], [4 x float]* %B, i64 0, i64 2
  %vec.cast7 = bitcast float* %vec.gep6 to <2 x float>*
  %col.load8 = load <2 x float>, <2 x float>* %vec.cast7, align 4
  %splat.splat = shufflevector <2 x float> %col.load5, <2 x float> poison, <2 x 
i32> zeroinitializer
  %0 = fmul <2 x float> %col.load, %splat.splat
  %splat.splat11 = shufflevector <2 x float> %col.load5, <2 x float> undef, <2 
x i32> 
  %1 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> 
%splat.splat11, <2 x float> %0)
  %splat.splat14 = shufflevector <2 x float> %col.load8, <2 x float> poison, <2 
x i32> zeroinitializer
  %2 = fmul <2 x float> %col.load, %splat.splat14
  %splat.splat17 = shufflevector <2 x float> %col.load8, <2 x float> undef, <2 
x i32> 
  %3 = call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %col.load3, <2 x float> 
%splat.splat17, <2 x float> %2)
  %vec.cast18 = bitcast [4 x float]* %C to <2 x float>*
  %col.load19 = load <2 x float>, <2 x float>* %vec.cast18, align 4
  %vec.gep20 = getelementptr [4 x float], [4 x float]* %C, i64 0, i64 2
  %vec.cast21 = bitcast float* %vec.gep20 to <2 x float>*
  %col.load22 = load <2 x float>, <2 x float>* %vec.cast21, align 4
  %4 = fadd <2 x float> %1, %col.load19
  %5 = fadd <2 x float> %3, %col.load22
  store <2 x float> %4, <2 x float>* %vec.cast18, align 4
  store <2 x float> %5, <2 x float>* %vec.cast21, align 4

I don't see a simple, reliable pattern to match the operands of %4 with %0 for 
example, and this is what I meant by the splat in the middle. The pragma 
approach assumes that we´re always working with architectures that the better 
approach is to fuse the fmul and fadds. The problem here is what you have to 
decide is between preloading the accumulator or not. On IBM Power10´s MMA this 
would be pretty far from optimal, for example, because you have specific 
instructions to load accumulators.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

2021-03-31 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added a comment.

@fhahn Ok I see what you mean now, this sounds like a doable path and might be 
able to cover architectures with specialized matrix multiplication instructions 
as well .

Just to see if I understand correctly I can add a matrix_add intrinsic, do a 
travesal looking for matrix_multiply and fuse both changing  
`LowerMatrixMultiplyFused` to support pre-loading the accumulator. Is that 
correct?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

2021-03-31 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added a comment.

In D99433#2662259 , @fhahn wrote:

> In D99433#2661919 , 
> @everton.constantino wrote:
>
>> @fhahn Ok I see what you mean now, this sounds like a doable path and might 
>> be able to cover architectures with specialized matrix multiplication 
>> instructions as well .
>>
>> Just to see if I understand correctly I can add a matrix_add intrinsic, do a 
>> travesal looking for matrix_multiply and fuse both changing  
>> `LowerMatrixMultiplyFused` to support pre-loading the accumulator. Is that 
>> correct?
>
> Yes that sounds like a good path forward! I think at the moment, adding a 
> matrix_mul_add intrinsic may be a bit premature, as we can just match & lower 
> directly in place, as we already do in `LowerMatrixMultiplyFused`. Once we 
> add more and more such transforms, it may really help to have additional 
> intrinsics (or we could just create our own dummy declarations which are just 
> used during the matrix lowering, to avoid adding too many intrinsics). But 
> for now I think can move along faster without adding a new intrinsic.

Great, Ill update the patch then. Thanks for the comments!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D97857: [Matrix] Add support for matrix-by-scalar division.

2021-03-10 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added inline comments.



Comment at: clang/test/CodeGen/matrix-type-operators.c:303
 }
 
+// CHECK-LABEL: @divide_double_matrix_scalar_float(

fhahn wrote:
> everton.constantino wrote:
> > Shouldn't a test for half floating point be added here as well? 
> Thanks for taking a look! I don't think we have any tests for half floating 
> points at the moment. The tests do not cover each possible combination of 
> types I think, because they re-use the existing code-gen for the stuff that's 
> element type specific (except deciding whether to use floating point or 
> integer ops).
> 
> I think it would probably be good to have some FP16 tests though, but that 
> should be a separate change IMO.
Thanks for the reply. Ok, that makes sense, I was under the false impression 
FP16 was already covered.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D97857/new/

https://reviews.llvm.org/D97857

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

2021-03-26 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino created this revision.
everton.constantino added reviewers: anemet, rjmccall, rsmith, Bigcheese, fhahn.
Herald added subscribers: dexonsmith, tschuett, hiraditya.
everton.constantino requested review of this revision.
Herald added subscribers: llvm-commits, cfe-commits, jdoerfert.
Herald added projects: clang, LLVM.

This patch creates a new builtin to support matrix multiply add. Currently when 
you do C = A*B + C you have the overhead of additional fadds. With this
builtin the accumulatores are loaded with the C matrix during the 
multiplication considerably reducing the ammount of operations.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D99433

Files:
  clang/docs/MatrixTypes.rst
  clang/include/clang/Basic/Builtins.def
  clang/include/clang/Basic/DiagnosticSemaKinds.td
  clang/include/clang/Sema/Sema.h
  clang/lib/CodeGen/CGBuiltin.cpp
  clang/lib/Sema/SemaChecking.cpp
  clang/test/CodeGen/matrix-type-builtins.c
  clang/test/Sema/matrix-type-builtins.c
  llvm/include/llvm/IR/Intrinsics.td
  llvm/include/llvm/IR/MatrixBuilder.h
  llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp

Index: llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
===
--- llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
+++ llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
@@ -511,6 +511,7 @@
 if (II)
   switch (II->getIntrinsicID()) {
   case Intrinsic::matrix_multiply:
+  case Intrinsic::matrix_multiply_add:
   case Intrinsic::matrix_transpose:
   case Intrinsic::matrix_column_major_load:
   case Intrinsic::matrix_column_major_store:
@@ -540,6 +541,7 @@
 
   Value *MatrixA;
   Value *MatrixB;
+  Value *MatrixC;
   Value *M;
   Value *N;
   Value *K;
@@ -547,6 +549,11 @@
   m_Value(MatrixA), m_Value(MatrixB), m_Value(M),
   m_Value(N), m_Value(K {
 Propagate = setShapeInfo(Inst, {M, K});
+  } else if (match(Inst,
+   m_Intrinsic(
+   m_Value(MatrixA), m_Value(MatrixB), m_Value(MatrixC),
+   m_Value(M), m_Value(N), m_Value(K {
+Propagate = setShapeInfo(Inst, {M, K});
   } else if (match(Inst, m_Intrinsic(
  m_Value(MatrixA), m_Value(M), m_Value(N {
 // Flip dimensions.
@@ -611,6 +618,7 @@
 
   Value *MatrixA;
   Value *MatrixB;
+  Value *MatrixC;
   Value *M;
   Value *N;
   Value *K;
@@ -622,7 +630,18 @@
 
 if (setShapeInfo(MatrixB, {N, K}))
   pushInstruction(MatrixB, WorkList);
+  } else if (match(V,
+   m_Intrinsic(
+   m_Value(MatrixA), m_Value(MatrixB), m_Value(MatrixC),
+   m_Value(M), m_Value(N), m_Value(K {
+if (setShapeInfo(MatrixA, {M, N}))
+  pushInstruction(MatrixA, WorkList);
 
+if (setShapeInfo(MatrixB, {N, K}))
+  pushInstruction(MatrixB, WorkList);
+
+if (setShapeInfo(MatrixC, {M, K}))
+  pushInstruction(MatrixC, WorkList);
   } else if (match(V, m_Intrinsic(
   m_Value(MatrixA), m_Value(M), m_Value(N {
 // Flip dimensions.
@@ -673,6 +692,7 @@
 
   switch (II->getIntrinsicID()) {
   case Intrinsic::matrix_multiply:
+  case Intrinsic::matrix_multiply_add:
   case Intrinsic::matrix_transpose:
   case Intrinsic::matrix_column_major_load:
   case Intrinsic::matrix_column_major_store:
@@ -769,6 +789,9 @@
 case Intrinsic::matrix_column_major_store:
   LowerColumnMajorStore(Inst);
   break;
+case Intrinsic::matrix_multiply_add:
+  LowerMultiplyAdd(Inst);
+  break;
 default:
   return false;
 }
@@ -1009,11 +1032,13 @@
 }
   }
 
-  /// Compute \p Result += \p A * \p B for input matrices with left-associating
-  /// addition.
+  /// Compute \p Result += \p A * \p B + \p ACC for input matrices with
+  /// left-associating addition.
+  template 
   void emitMatrixMultiply(MatrixTy &Result, const MatrixTy &A,
   const MatrixTy &B, bool AllowContraction,
-  IRBuilder<> &Builder, bool isTiled) {
+  IRBuilder<> &Builder, bool isTiled,
+  const MatrixTy *ACC = nullptr) {
 const unsigned VF = std::max(
 TTI.getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
 .getFixedSize() /
@@ -1030,20 +1055,25 @@
 unsigned NumComputeOps = 0;
 if (A.isColumnMajor()) {
   // Multiply columns from the first operand with scalars from the second
-  // operand. Then move along the K axes and accumulate the columns.  With
+  // operand. Then move along the K axes and accumulate the columns. With
   // this the adds can be vectorized without reassociat

[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

2021-03-26 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added a comment.

@jdoerfert Which tests do you have in mind? I added one for SEMA and one for 
CodeGen.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D99433: [Matrix] Including __builtin_matrix_multiply_add for the matrix type extension.

2021-03-26 Thread Everton Constantino via Phabricator via cfe-commits
everton.constantino added a comment.

@fhahn That was my first idea however its not as simple as it looks. I tried 
moving the adds but splats make it considerably harder to find a pattern that 
catches this and fuses the multiplies specially with bigger matrices. My real 
wish was to actually add a new IR instruction to handle matrices because the 
MADD is but a simple example of other more interesting optimizations that can 
be done, like using matrix associative properties to reduce the number of 
calculations. I found that path too complicated however and I opted for a 
compromise at the moment. I wish to start writing some GEMM micro-kernels with 
this extension and this builtin was the shortest path.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99433/new/

https://reviews.llvm.org/D99433

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits