guberti opened a new pull request, #12671: URL: https://github.com/apache/tvm/pull/12671
Currently, Relay QNN uses its `helper_no_fast_int8_hw_legalization` change all `int8` operations in Cortex-M-compiled models into `int16` operations. I believe we do this because Cortex-M chips do not have a `4xint8` multiply-accumulate function, and so in some sense we don't have fast `int8` operations. However, changing our operators to `int16` is substantially slower, as while it saves a few sign extension operations, it doubles the amount of memory loads we need to perform. This PR changes this behavior, and ensures that Cortex-M microcontrollers do not have `int8` operations turned into `int16` ones. I have also verified that this does, in fact, improve performance for some common models. For example, MobileNet_v1_0.25 on the Cortex-M4 saw a 10% performance improvement, compared to before this change. Accuracy does not seem to be affected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org