nolanliou opened a new issue #6354:
URL: https://github.com/apache/incubator-tvm/issues/6354


   Compared two similar Bert models running on CPU with TVM, one is PyTorch 
model, the other is MXNet model. Due to the large performance difference, I did 
some profiling. The result shows the run time of the same operation(matmul) 
with same workload varies big.
   
   ENV:
   1. TVM: build with MKL.
   2. Intel CPU
   3. OpenMP: `KMP_AFFINITY=compact,1,0 OMP_NUM_THREADS=24`
   
   Model inference time:
   ```
   # mxnet model
   TVM Mean inference time: 5.53 ms
   # pytorch model
   TVM Mean inference time: 23.05 ms
   ```
   
   Profiling result:
   ```
   # MXNet model
   Node Name                           Ops.                Time(us)   Time(%)  
Shape.  Inputs  Outputs
   --------- 
   fused_nn_dense_add_15        fused_nn_dense_add_1       308.926   5.58     
(32, 768)      3       1
   fused_nn_dense_add_11         fused_nn_dense_add_1       307.277   5.551    
(32, 768)        3       1
   
   # PyTorch Model
   Node Name                           Ops.                Time(us)   Time(%)  
Shape.  Inputs  Outputs
   --------- 
   fused_nn_dense_add_3        fused_nn_dense_add_3       1783.75    7.631    
(32, 768)     3       1
   fused_nn_dense_add_31      fused_nn_dense_add_3        1593.08    6.815    
(32, 768)    3       1
   ```
   
   IR code (same between PyTorch model and MXNet model)
   ```
             attr [0] "compute_scope" = "fused_nn_dense_add_3_compute_";
             attr [C: handle] "storage_scope" = "global";
             allocate(C, float32, [24576]) {
               attr [0] "extern_scope" = 0;
               @tir.tvm_call_packed("tvm.contrib.cblas.matmul", 
@tir.tvm_stack_make_array(placeholder, @tir.tvm_stack_make_shape(32, 3072, 
dtype=handle), 0, 2, 0f32, 0, dtype=handle), 
@tir.tvm_stack_make_array(placeholder_1, @tir.tvm_stack_make_shape(768, 3072, 
dtype=handle), 0, 2, 0f32, 0, dtype=handle), @tir.tvm_stack_make_array(C, 
@tir.tvm_stack_make_shape(32, 768, dtype=handle), 0, 2, 0f32, 0, dtype=handle), 
False, True, dtype=int32)
               for (ax0: int32, 0, 32) "parallel" {
                 for (ax1: int32, 0, 768) {
                   T_add[((ax0*768) + ax1)] = ((float32*)C[((ax0*768) + ax1)] + 
(float32*)placeholder_2[ax1])
                 }
               }
   ```
   
   However, when setting `OMP_NUM_THREADS=1` the model inference time is same, 
seems it's a problem with multiple threads. 
   
   What may cause the difference? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to