mbrookhart opened a new pull request #7123:
URL: https://github.com/apache/tvm/pull/7123


   As a followup to #6839 , this parallelizes the cumsum in get_valid_counts 
using an upsweep/downsweep tree-based prefix sum algorithm, similar to what I 
did in #7099.
   
   On my 1070 Ti, testing deploy_ssd_gluoncv.py, I previously reported that 
get_valid_counts took 3674.62 microseconds this reduces that to 495.8 
microseconds.
   
   @masahi has expressed interest in implementing a more general prefix scan 
for other ops, as future work I expect we'll refactor this and do possible 
cache optimization.
   
   Thanks
   
   cc @Laurawly @zhiics @kevinthesun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to