masahi opened a new pull request #6782: URL: https://github.com/apache/incubator-tvm/pull/6782
This adds support for what PyTorch calls "dynamic quantization", where weights are quantized ahead of time but activations are quantized on the fly at runtime. See more details in: https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#the-three-modes-of-quantization-supported-in-pytorch-starting-version-13 https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html TVM doesn't support such quantization flow at the moment. This flow is in a sweep spot in terms of quantization easy of use and performance, so I think this is worth supporting. Here are pros/cons compared to the static quantization (the one we do support): Pros: * API is trivial and quantization is automatic. No need to rewrite a model or do calibration (which is required for other quantization workflow in PyTorch). * Weight is quantized ahead of time, so model size becomes much smaller. We can also use int8 math. Cons: * Scale and zero point calculation is done at runtime, so there is some overhead compared to the more standard static quantization. My motivation for introducing this flow is to support quantized models from `transformers` like BERT and GPT2, where dynamic quantization via PyTorch or ONNXRuntime is the only quantization path they support (from what I understand). See the following blog post and the accompanying notebook by the ONNXRuntime team for inspiration. https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7 https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb This PR has changes required for supporting dynamic quantization flow via QNN. * Support non constant qparams in QNN quantize and dense op * Add Torchscript `quantized::linear_dynamic` op converter in PyTorch frontend. I prepared [a script](https://github.com/masahi/torchscript-to-tvm/tree/master/transformers) to evaluate accuracy and performance of BERT quantized via dynamic quantization, and compiled by TVM. The accuracy is reasonable but performance is terrible. Even with MKL enabled, TVM int8 is 3-4x slower than PyTorch (I haven't looked into details). I sense a big oppotunity here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org