masahi commented on PR #15772:
URL: https://github.com/apache/tvm/pull/15772#issuecomment-1733063297

   > I actually did not understand how we might be able to work with just "QDQ 
representation" with legalization. Would we be doing the work of both FQ2I and 
Canonicalize pass together in one shot to get an actual quantized 
implementation of the operator and schedule that primfunc?
   
   I think it depends on how a backend wants to process a quantized model. For 
CPU or GPU backends where "an actual quantized implementation" is not necessary 
(as long as they can see (int8, int8) -> int32 GEMM in the lowered model, for 
example), indeed the composition of FQ2I and Canonicalize might be sufficient. 
For more advanced cases where one wants to minimize int8 <-> fp32 conversions, 
a backend would need to do something like "QDQ propagation" done by TRT 
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#tensorrt-process-qdq.
 Although TRT says they only support the QDQ representation for quantized 
models, they clearly have an internal int8 representation to represent the 
result of such propagation step.   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to