[GitHub] [incubator-tvm] masahi opened a new pull request #6782: Torch quant linear dynamic

GitBox Tue, 27 Oct 2020 21:02:26 -0700


masahi opened a new pull request #6782:
URL: https://github.com/apache/incubator-tvm/pull/6782

This adds support for what PyTorch calls "dynamic quantization", where
weights are quantized ahead of time but activations are quantized on the fly at
runtime. See more details in:

https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#the-three-modes-of-quantization-supported-in-pytorch-starting-version-13
https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html

TVM doesn't support such quantization flow at the moment. This flow is in a
sweep spot in terms of quantization easy of use and performance, so I think
this is worth supporting. Here are pros/cons compared to the static
quantization (the one we do support):

Pros:
* API is trivial and quantization is automatic. No need to rewrite a model
or do calibration (which is required for other quantization workflow in
PyTorch).
* Weight is quantized ahead of time, so model size becomes much smaller. We
can also use int8 math.

Cons:
* Scale and zero point calculation is done at runtime, so there is some
overhead compared to the more standard static quantization.

My motivation for introducing this flow is to support quantized models from
`transformers` like BERT and GPT2, where dynamic quantization via PyTorch or
ONNXRuntime is the only quantization path they support (from what I
understand). See the following blog post and the accompanying notebook by the
ONNXRuntime team for inspiration.

https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7

https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb

This PR has changes required for supporting dynamic quantization flow via
QNN.
* Support non constant qparams in QNN quantize and dense op
* Add Torchscript `quantized::linear_dynamic` op converter in PyTorch
frontend.

I prepared [a
script](https://github.com/masahi/torchscript-to-tvm/tree/master/transformers)
to evaluate accuracy and performance of BERT quantized via dynamic
quantization, and compiled by TVM. The accuracy is reasonable but performance
is terrible. Even with MKL enabled, TVM int8 is 3-4x slower than PyTorch (I
haven't looked into details). I sense a big oppotunity here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-tvm] masahi opened a new pull request #6782: Torch quant linear dynamic

Reply via email to