masahi commented on PR #15772: URL: https://github.com/apache/tvm/pull/15772#issuecomment-1733063297
> I actually did not understand how we might be able to work with just "QDQ representation" with legalization. Would we be doing the work of both FQ2I and Canonicalize pass together in one shot to get an actual quantized implementation of the operator and schedule that primfunc? I think it depends on how a backend wants to process a quantized model. For CPU or GPU backends where "an actual quantized implementation" is not necessary (as long as they can see (int8, int8) -> int32 GEMM in the lowered model, for example), indeed the composition of FQ2I and Canonicalize might be sufficient. For more advanced cases where one wants to minimize int8 <-> fp32 conversions, a backend would need to do something like "QDQ propagation" done by TRT https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#tensorrt-process-qdq. Although TRT says they only support the QDQ representation for quantized models, they clearly have an internal int8 representation to represent the result of such propagation step. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org