mbrookhart opened a new pull request #7123: URL: https://github.com/apache/tvm/pull/7123
As a followup to #6839 , this parallelizes the cumsum in get_valid_counts using an upsweep/downsweep tree-based prefix sum algorithm, similar to what I did in #7099. On my 1070 Ti, testing deploy_ssd_gluoncv.py, I previously reported that get_valid_counts took 3674.62 microseconds this reduces that to 495.8 microseconds. @masahi has expressed interest in implementing a more general prefix scan for other ops, as future work I expect we'll refactor this and do possible cache optimization. Thanks cc @Laurawly @zhiics @kevinthesun ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org