Laurawly edited a comment on pull request #6839:
URL: https://github.com/apache/tvm/pull/6839#issuecomment-742728283


   > @Laurawly I took the mxnet example you provided and ran it with the debug 
runtime. It required a little bit of editing, APIs have changed slightly since 
that tutorial was written. Anyway, this is what I get on my 1070 TI with Thrust 
enabled.
   > 
   > main:
   > 
   > ```
   > Ops                                                                        
                             Time(us)    Time(%)  Shape
   > ---                                                                        
                             --------    -------  ----- 
   > fused_vision_non_max_suppression                                           
                             139329.0    74.66    (1, 122640, 6)
   > fused_vision_get_valid_counts                                              
                             124.255     0.067    (1, 122640, 6)     
   > ```
   > 
   > this PR:
   > 
   > ```
   > fused_vision_get_valid_counts                                              
                             46138.3    50.891   (1, 122640, 6)  
   > fused_vision_non_max_suppression                                           
                             12319.8    13.589   (1, 122640, 6)
   > ```
   > 
   > The get valid counts function slow down, but I'm actually seeing the total 
runtime of these ops decrease from 139.3ms to 58.5ms
   > 
   > My modifications to the example can be found here: 
https://gist.github.com/mbrookhart/df25427cbbfb3c73ed16be72c8525610
   
   The time measurement is not as fast as before because of this PR: 
https://github.com/apache/tvm/pull/7005. If you reverse back this one, you 
should get fairly good performance without any effect on the correctness. 
Because after my improvement, nms is not a bottleneck of the ssd model anymore 
while in your measurement, it seems it still is. So my point is your changes to 
PR #7005 here: 
https://github.com/apache/tvm/blob/f332512629e94fd8d761f3dd002fd3aa91673ce0/python/tvm/topi/cuda/nms.py#L470-L483
 
   and here: 
   
https://github.com/apache/tvm/blob/f332512629e94fd8d761f3dd002fd3aa91673ce0/python/tvm/topi/cuda/nms.py#L632-L643
   should be faster and thus gets the performance improvement. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to