dsbarinov1 commented on code in PR #14209:
URL: https://github.com/apache/tvm/pull/14209#discussion_r1143435388


##########
python/tvm/tir/tensor_intrin/arm_cpu.py:
##########
@@ -131,8 +163,68 @@ def dot_product_4x4_i8i8i32_sdot(
         )
 
 
[email protected]_func
+def dot_product_4x4_u8u8u32_udot(
+    A: T.Buffer((4,), "uint8", offset_factor=1),
+    B: T.Buffer((4, 4), "uint8", offset_factor=1),
+    C: T.Buffer((4,), "uint32", offset_factor=1),
+) -> None:
+    with T.block("root"):
+        T.reads(C[0:4], A[0:4], B[0:4, 0:4])
+        T.writes(C[0:4])
+
+        A_i8x4 = A.vload([0], "uint8x4")
+        A_i32 = T.reinterpret(A_i8x4, dtype="uint32")
+        vec_ai32 = T.broadcast(A_i32, 4)
+        vec_a = T.reinterpret(vec_ai32, dtype="uint8x16")
+
+        vec_b = B.vload([0, 0], dtype="uint8x16")
+
+        vec_c = C.vload([0], dtype="uint32x4")
+
+        C[T.ramp(T.int32(0), 1, 4)] = T.call_llvm_pure_intrin(
+            T.llvm_lookup_intrinsic_id("llvm.aarch64.neon.udot.v4u32.v16u8"),
+            T.uint32(3),
+            vec_c,
+            vec_a,
+            vec_b,
+            dtype="uint32x4",
+        )
+
+
[email protected]_func
+def dot_product_4x4_u8u8i32_hdot(
+    A: T.Buffer((4,), "uint8", offset_factor=1),
+    B: T.Buffer((4, 4), "uint8", offset_factor=1),
+    C: T.Buffer((4,), "int32", offset_factor=1),
+) -> None:
+    with T.block("root"):
+        T.reads(C[0:4], A[0:4], B[0:4, 0:4])
+        T.writes(C[0:4])
+
+        A_i8x4 = A.vload([0], "uint8x4")
+        A_i32 = T.reinterpret(A_i8x4, dtype="uint32")
+        vec_ai32 = T.broadcast(A_i32, 4)
+        vec_a = T.reinterpret(vec_ai32, dtype="uint8x16")
+
+        vec_b = B.vload([0, 0], dtype="uint8x16")
+
+        vec_c = C.vload([0], dtype="int32x4")
+
+        C[T.ramp(T.int32(0), 1, 4)] = T.call_llvm_pure_intrin(
+            T.llvm_lookup_intrinsic_id("llvm.aarch64.neon.udot.v4u32.v16u8"),

Review Comment:
   When experimenting with **tflite_mobilenet_v3_quant** model, we encountered 
convolution with multiplication of uint8 uint8 tensors into int32 accumulator, 
which did not allow to apply existing _sdot/udot_ intrinsics, so we had to 
create new _hdot_ intrinsic, which already works with such dtypes layout. From 
my knowledge, there is no such neon instruction to work with u8u8i32 layout of 
dtypes, for that reason we could try to call the closest instruction, which we 
did and the intrinsic succesfully applied to that type of convolution, bringing 
us a performance benefit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to