masahi commented on pull request #9737:
URL: https://github.com/apache/tvm/pull/9737#issuecomment-993909962


   > @masahi Maybe the reason why the TVM script takes so long is that you are 
doing 100 iterations per benchmark where as the cutlass script is only doing 20?
   
   I believe `cutlass_profiler` is also doing 100 iterations by default: 
https://github.com/NVIDIA/cutlass/blob/808c25337a3ed4c97ac21895257b1addc72d6ca8/tools/profiler/src/options.cu#L386
   
   >  Also the TVM script is running through the whole tvm compilation pipeline 
for each workload.
   
   As I commented in 
https://github.com/apache/tvm/pull/9737#discussion_r768968554, we don't invoke 
the tvm pipeline when we select cutlass kernels. 
   
   One major difference with two scripts are that cutlass compiles all kernels 
into one giant profiler executable, while we generate separate executables for 
each kernel. So cutlass can allocate / deallocate memory once and loop through 
each kernel for a given workload to select the best one. Also, I remember that 
there is a non-trivial initialization cost (close to 1 sec) for any CUDA apps, 
when we invoke the first CUDA API call - for `cutlass_profiler` this happens 
only once while we pay that cost for each profiler binaries (there about 60 of 
them). 
   
   But this still doesn't explain 10x difference, so I believe there is 
something else going on. We could adopt the same approach as `cutlass_profiler` 
and compile all candidate kernels into one executable. I didn't do that for 
`conv2d_profiler` because I just followed how `gemm_profiler` is implemented, 
but that could be a possible improvement. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to