masahi opened a new pull request #6782:
URL: https://github.com/apache/incubator-tvm/pull/6782


   This adds support for what PyTorch calls "dynamic quantization", where 
weights are quantized ahead of time but activations are quantized on the fly at 
runtime. See more details in:
   
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#the-three-modes-of-quantization-supported-in-pytorch-starting-version-13
   https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
   
https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html
   
   TVM doesn't support such quantization flow at the moment. This flow is in a 
sweep spot in terms of quantization easy of use and performance, so I think 
this is worth supporting. Here are pros/cons compared to the static 
quantization (the one we do support):
   
   Pros:
   * API is trivial and quantization is automatic. No need to rewrite a model 
or do calibration (which is required for other quantization workflow in 
PyTorch).
   * Weight is quantized ahead of time, so model size becomes much smaller. We 
can also use int8 math.
   
   Cons:
   * Scale and zero point calculation is done at runtime, so there is some 
overhead compared to the more standard static quantization.
   
   My motivation for introducing this flow is to support quantized models from 
`transformers` like BERT and GPT2, where dynamic quantization via PyTorch or 
ONNXRuntime is the only quantization path they support (from what I 
understand). See the following blog post and the accompanying notebook by the 
ONNXRuntime team for inspiration.
   
   
https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7
   
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb
   
   This PR has changes required for supporting dynamic quantization flow via 
QNN.
   * Support non constant qparams in QNN quantize and dense op
   * Add Torchscript `quantized::linear_dynamic` op converter in PyTorch 
frontend.
   
   I prepared [a 
script](https://github.com/masahi/torchscript-to-tvm/tree/master/transformers) 
to evaluate accuracy and performance of BERT quantized via dynamic 
quantization, and compiled by TVM. The accuracy is reasonable but performance 
is terrible. Even with MKL enabled, TVM int8 is 3-4x slower than PyTorch (I 
haven't looked into details). I sense a big oppotunity here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to