[apache/incubator-mxnet] [RFC] Proposal to Graduate MXNet’s ONNX Support (#20063)

Zhaoqi Zhu Fri, 19 Mar 2021 21:44:40 -0700

### What is ONNX and Why

[ONNX](https://onnx.ai/), or Open Neural Network Exchange, is an open source 
deep learning model format that acts as a framework neutral graph 
representation between DL frameworks or between training and inference. With 
the ability to export models to the ONNX format, MXNet users can enjoy faster 
inference and a wider range of deployment device choices, including edge and 
mobile devices where MXNet installation may be constrained. Popular 
hardware-accelerated and/or cross-platform ONNX runtime frameworks include 
Nvidia [TensorRT](https://github.com/onnx/onnx-tensorrt), Microsoft 
[ONNXRuntime](https://github.com/microsoft/onnxruntime), Apple 
[CoreML](https://github.com/onnx/onnx-coreml) and 
[TVM](https://tvm.apache.org/docs/tutorials/frontend/from_onnx.html), etc.

There is a huge ecosystem revolving ONNX. More than 1,000 projects are built on
top of ONNXRuntime according to this [GitHub dependency
page](https://github.com/microsoft/onnxruntime/network/dependents). It’s
crucial to make MXNet an active participant of this thriving community.

### The “Before” of ONNX Support in MXNet

The ONNX support for MXNet was first introduced in 2017-2018, with both the
modules to export to and import from the ONNX format. However, we did not keep
up with the latest ONNX format since that time and the format has iterated
through several new [op
sets](https://github.com/onnx/onnx/blob/master/docs/Operators.md). While the
MXNet community has always had demand, the current ONNX support (shipped in
MXNet 1.6, 1.7, 1.8) is outdated and supplies little — most of MXNet users use
the [GluonCV](https://cv.gluon.ai/contents.html) and
[GluonNLP](https://nlp.gluon.ai/) tool kits to train, fine-tune models or load
pretrained models from the model zoo, but the majority of these models cannot
be exported by the current ONNX support.

(The current model support can be found
[here](https://cwiki.apache.org/confluence/display/MXNET/ONNX+Operator+Coverage),
at the bottom of the page.)

### The New Development on ONNX Support

Lately, We (Joe @josephevans, Wei @waytrue17, and I @Zha0q1 ) have been working
on restoring and improving ONNX support by supporting the export of the most
popular and state-of-the-art models. We are currently operating on the MXNet
v1.x branch, which is compatible with the latest few GluonCV/NLP releases that
most users have installed. So far, we have supported 90% of all the (180+)
GluonCV pretrained models, and 90% of the exports have also been verified to
produce consistent outputs with that of native MXNet. For GluonNLP, we have
added export support for RNN, Bert (and alike), GPT, and Transformer models. We
have also worked on highly-requested features such as dynamic input shapes and
graph optimization. Our overall goal is a “train on Apache MXNet and deploy
anywhere" user experience which offers the most inference flexibility.

We plan to release the new ONNX support in the next v1.x (1.9) version.

### Proposal Summary

Now that the ONNX support is mature, we would like to graduate the ONNX support
from the `mx.contrib` namespace into an official and stable MXNet feature. We
believe this will help us best publicize the work and serve the needs of
interested users. Below is a summary of the graduation tasks, and each point is
explained in detail in the following paragraphs.

1. Graduate the MX2ONNX module (exporting to ONNX format) from the
`mx.contrib.onnx` namespace to a regular and shallower directory (such as
`mxnet.mx2onnx`).
2. Deprecate the ONNX2MX module (importing from ONNX format)
3. Only support ONNX 1.7, 1.8, and future releases for MX2ONNX.
4. Add a setup.py for MX2ONNX.
5. Up to date, accurate, and better documentation.

### Graduate MX2ONNX from contrib

Currently MX2ONNX hides very deep in the mxnet.contrib namespace. As the ONNX
export support is maturing and stabilizing quickly, we should graduate it from
the experimental “contrib” folder into a shallower namespace (such as
`mxnet.mx2onnx`). This way we can better promote it as an official feature and
users will try it out with more trust. Also, we can add readme/doc files to the
new MX2ONNX directory explaining the ONNX compatibility, APIs, operator
support, model support, etc. Those documentations can be updated with each
feature or bug fix commit and always stay up to date. We can easily point MXNet
users interested in ONNX to this new directory or they can google into it. For
reference, [PyTorch’s ONNX
support](https://github.com/pytorch/pytorch/tree/master/torch/onnx) is in
`torch.onnx` and [tf2onnx](https://github.com/onnx/tensorflow-onnx) has its own
repository. Both of them are the first search result on Google while [MXNet’s
ONNX
support](https://github.com/apache/incubator-mxnet/tree/v1.x/python/mxnet/contrib/onnx/mx2onnx)
is not in the first 3 pages (key worlds are [Framework name], ONNX, GitHub).

We should graduate MX2ONNX in the next v1.x (1.9) release. Because the
“contrib” namespace generally does not have any backward compatibility promise,
we propose to move the MX2ONNX files entirely to the new directory
`incubator-mxnet/python/mx2onnx`. We can keep a dummy python file in the old
`mxnet.contrib.onnx.mx2onnx` directory with only the API definitions. When user
call the APIs in through the old path they will get an error about that path
being deprecated and what the new directory is. We can then remove this dummy
file in the next next (1.10) release.

### Deprecate the ONNX2MX Module

The ONNX2MX module (importing from ONNX format) was introduced in 2017 when
MXNet had more performant deployment solutions than other frameworks, better
distributed training story, and more language supports. However, this has
changed since then and importing from ONNX is no longer a requested use case.
As we are focusing on making exporting to ONNX for deployment a smooth
experience, we should deprecate the unrequested and under invested ONNX2MX
module. The effort should include clearing the python files, test cases, and
relevant documentations and tutorials.

We propose: 1) keep the ONNX2MX module in the next (1.9) release and add a
deprecation warning to the APIs. 2) remove all ONNX2MX related files, as
mentioned above, in the next next (1.10) release.

### Only support ONNX 1.7 and Onward for MX2ONNX.

ONNX generally releases bi-yearly. ONNX 1.7 was released in May 2020 and 1.8 in
Nov 2020. With each new ONNX version a new op set is released to either add new
operators or revise the configuration of the existing ones. We propose to only
support 1.7 and onward to make the development more focused (this way we do not
need to spend extra time on implementations for the same MXNet operator for
earlier ONNX op sets). This won’t be a blocker for model deployment as the
inference frameworks generally add support for the latest ONNX very quickly.
ONNXRuntime always support the latest ONNX right away and the latest TensorRT
currently supports up to ONNX 1.7.

We plan to continue to support new ONNX versions after 1.8. Existing operator
and model tests can help validate the new implementations of the operators that
have a updated specification in the new op set. Users can choose to upgrade to
the new ONNX version or they can stay at the current version if there is no
need to upgrade. ONNX runtime frameworks are generally backward compatible with
all previously op sets and models generated based on them.

### Add a setup.py to MX2ONNX

We propose to create a setup.py so that users of earlier MXNet versions,
especially those who cannot easily upgrade to the newest MXNet, can also enjoy
the latest ONNX support by pulling the next release branch (v1.9) and doing a
pip install locally. After installing through setup.py, users should be able to
do `import mx2onnx` and make API calls through the mx2onnx name. Because
MX2ONNX only relies on MXNet for type and shape inference, this separate
installation should work with any MXNet version as along as the model itself is
compatible with that version.

In our website, we should instruct the users to always try to pull from the
release branches (v.1.9 and onward). Users should be told to use use discretion
when pulling from development branches (v1.x) for the latest and unreleased
ONNX support.

### Better Documentation

We will need to get rid of the existing documentations and tutorials as they
are outdated. New documents on MXNet website should include:

1. MX2ONNX APIs
2. Tutorial on exporting to ONNX
3. Tutorial to get the ONNX model to work on ONNXRuntime and TensorRT

In the MX2ONNX directory, we will need to have readme files on:

1. Compatible ONNX, ONNXRuntime versions
2. MX2ONNX APIs
3. Operator support matrix
4. Gluon CV/NLP model zoo support matrix
5. Tutorial for pip installing MX2ONNX

### Foward-port the Same Changes to MXNet 2.0

At this time we are prioritizing supporting the needs of current MXNet and
Gluon CV/NLP users. When MXNet 2.0 compatible Gluon CV/NLP are stabilized we
will forward port the same changes as proposed above.

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/20063

[apache/incubator-mxnet] [RFC] Proposal to Graduate MXNet’s ONNX Support (#20063)

Reply via email to