Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marek Kolodziej
Hi Rajan,

I wanted to share on Confluence, but it didn't allow me to create a new
document. If my e-mail address gets permissions to add new Confluence
pages, I'll transfer the contents to Confluence. Please keep me posted when
I get edit permissions.

Thanks!

Marek



On Mon, Jun 11, 2018 at 11:02 AM singh.raja...@gmail.com <
singh.raja...@gmail.com> wrote:

> HI Marek,
>
> Thanks for sharing the  document. It would be great if you could share it
> on confluence wiki or a quip document. The formatting here makes it very
> difficult to read a long document.
>
> Appreciate the help.
>
> Thanks
> Rajan
>
> On 2018/06/11 17:50:26, Marek Kolodziej  wrote:
> > *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing
> an
> > initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> > discussion coming in the next few days, ETA of the first draft of the PR
> is
> > this Friday or even earlier). Feedback is appreciated.Best,Marek
> > KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> > significant acceleration of model inference on NVIDIA GPUs compared to
> > running the full graph in MxNet using unfused GPU operators. In addition
> to
> > faster fp32 inference, TensorRT optimizes fp16 inference, and is capable
> of
> > int8 inference (provided the quantization steps are performed). Besides
> > increasing throughput, TensorRT significantly reduces inference latency,
> > especially for small batches. See more here
> > <https://developer.nvidia.com/tensorrt>.2. Despite its benefits, using
> > pre-trained models with TensorRT typically requires some effort - either
> > re-writing the model using TensorRT’s graph building APIs, or exporting a
> > model to ONNX, followed by an import step. Even if the import is
> simplified
> > using ONNX, the TensorRT user still needs to provide their own data
> > pipeline, which used to exist in the framework, but no longer does in a
> > stand-alone TensorRT deployment with a client application.3. TensorRT is
> > very performant, but does not have the full set of MxNet’s operators.
> While
> > that could be addressed with TensorRT plugins, it’s much simpler to reuse
> > already-exisitng MxNet operators. Also, the user shouldn’t care about
> > knowing which operators are supported by TensorRT and which ones aren’t -
> > runtime integration allows the graph partitioner to extract subgraphs
> > capable of running inside of TensorRT, place the subgraph in a TensorRT
> > operator in MxNet, execute that operator as part of MxNet’s graph
> > execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> > operators remaining after the TensorRT subgraph extraction and node
> > substitution. The goal is to accelerate inference without changing user
> > experience.Design considerations 1. Since TensorRT can only determine all
> > possible optimizations once the tensor shapes are known, it is imperative
> > that all the shape information be provided. This means that the best time
> > to construct the TensorRT graph is bind time. The coming PR can
> selectively
> > apply the TensorRT optimization for inference-only graphs at symbol bind
> > time. This is in fact consistent with the assumptions about TensorRT made
> > on the MxNet Wiki here
> > <
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> >.
> > 2. Since as mentioned in #1, TensorRT graph building needs shape
> > information only available at bind time, an important goal was not to
> > disrupt any existing APIs. Even though C++ permits default function
> > arguments, the Python bindings for symbol-related methods (e.g. simple
> > bind) are exposed via a C, not C++, API, wired on the Python side using
> > Ctypes (e.g. see here
> > <
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> >
> > for the simple bind integration). This precludes the addition of extra
> > arguments without causing breaking changes in the C API. Also, adapting
> the
> > Python code to such changes wouldn’t be enough, since all frontend
> > languages use the C (not C++) API for the FFI. Fortunately, C API changes
> > could be avoided, by simply letting the user enable or disable the
> TensorRT
> > pass using an environment variable (USE_TENSORRT=1 to enable). This also
> > does not diminish the flexibility of the integration, since the graph
> pass
> > can read the environment variable each time symbol binding is done, and
> > hence permits turning the graph passes on and off, 

Re: Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marek Kolodziej
Hi Marco,

Sorry for the formatting being lost.

Here's the original Google doc. I actually wanted to originally use
Confluence, but I didn't have permissions to edit, so here goes.

https://docs.google.com/document/d/1UbsUacxWRKXCEE6v0r4VmKL76QLmFQYgMyAcQP0I8U0/edit?usp=sharing

Best,

Marek



On Mon, Jun 11, 2018 at 10:54 AM Marco de Abreu <
marco.g.ab...@googlemail.com> wrote:

> Hello Marek,
>
> this sounds great! Definitely looking forward to it.
>
> It seems like our mailing list destroyed your formatting. You might want to
> consider putting it into a Google Docs document or uploading it to
> confluence.
>
> Best regards,
> Marco
>
> On Mon, Jun 11, 2018 at 10:50 AM Marek Kolodziej  wrote:
>
> > *Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing
> an
> > initial integration of TensorRT as a runtime accelerator of MxNet (PR for
> > discussion coming in the next few days, ETA of the first draft of the PR
> is
> > this Friday or even earlier). Feedback is appreciated.Best,Marek
> > KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
> > significant acceleration of model inference on NVIDIA GPUs compared to
> > running the full graph in MxNet using unfused GPU operators. In addition
> to
> > faster fp32 inference, TensorRT optimizes fp16 inference, and is capable
> of
> > int8 inference (provided the quantization steps are performed). Besides
> > increasing throughput, TensorRT significantly reduces inference latency,
> > especially for small batches. See more here
> > <https://developer.nvidia.com/tensorrt>.2. Despite its benefits, using
> > pre-trained models with TensorRT typically requires some effort - either
> > re-writing the model using TensorRT’s graph building APIs, or exporting a
> > model to ONNX, followed by an import step. Even if the import is
> simplified
> > using ONNX, the TensorRT user still needs to provide their own data
> > pipeline, which used to exist in the framework, but no longer does in a
> > stand-alone TensorRT deployment with a client application.3. TensorRT is
> > very performant, but does not have the full set of MxNet’s operators.
> While
> > that could be addressed with TensorRT plugins, it’s much simpler to reuse
> > already-exisitng MxNet operators. Also, the user shouldn’t care about
> > knowing which operators are supported by TensorRT and which ones aren’t -
> > runtime integration allows the graph partitioner to extract subgraphs
> > capable of running inside of TensorRT, place the subgraph in a TensorRT
> > operator in MxNet, execute that operator as part of MxNet’s graph
> > execusion, and handle non-TensorRT-compatible nodes as regular MxNet
> > operators remaining after the TensorRT subgraph extraction and node
> > substitution. The goal is to accelerate inference without changing user
> > experience.Design considerations 1. Since TensorRT can only determine all
> > possible optimizations once the tensor shapes are known, it is imperative
> > that all the shape information be provided. This means that the best time
> > to construct the TensorRT graph is bind time. The coming PR can
> selectively
> > apply the TensorRT optimization for inference-only graphs at symbol bind
> > time. This is in fact consistent with the assumptions about TensorRT made
> > on the MxNet Wiki here
> > <
> >
> https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+acceleration+libraries
> > >.
> > 2. Since as mentioned in #1, TensorRT graph building needs shape
> > information only available at bind time, an important goal was not to
> > disrupt any existing APIs. Even though C++ permits default function
> > arguments, the Python bindings for symbol-related methods (e.g. simple
> > bind) are exposed via a C, not C++, API, wired on the Python side using
> > Ctypes (e.g. see here
> > <
> >
> https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol/symbol.py#L1486:L1521
> > >
> > for the simple bind integration). This precludes the addition of extra
> > arguments without causing breaking changes in the C API. Also, adapting
> the
> > Python code to such changes wouldn’t be enough, since all frontend
> > languages use the C (not C++) API for the FFI. Fortunately, C API changes
> > could be avoided, by simply letting the user enable or disable the
> TensorRT
> > pass using an environment variable (USE_TENSORRT=1 to enable). This also
> > does not diminish the flexibility of the integration, since the graph
> pass
> > can read the environment variable each time symbol binding is

Details regarding upcoming PR for runtime TensorRT integration

2018-06-11 Thread Marek Kolodziej
*Hi everyone,This is a quick summary of NVIDIA’s plans for open-sourcing an
initial integration of TensorRT as a runtime accelerator of MxNet (PR for
discussion coming in the next few days, ETA of the first draft of the PR is
this Friday or even earlier). Feedback is appreciated.Best,Marek
KolodziejNeed for runtime MxNet-TensorRT integration 1. TensorRT provides
significant acceleration of model inference on NVIDIA GPUs compared to
running the full graph in MxNet using unfused GPU operators. In addition to
faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of
int8 inference (provided the quantization steps are performed). Besides
increasing throughput, TensorRT significantly reduces inference latency,
especially for small batches. See more here
.2. Despite its benefits, using
pre-trained models with TensorRT typically requires some effort - either
re-writing the model using TensorRT’s graph building APIs, or exporting a
model to ONNX, followed by an import step. Even if the import is simplified
using ONNX, the TensorRT user still needs to provide their own data
pipeline, which used to exist in the framework, but no longer does in a
stand-alone TensorRT deployment with a client application.3. TensorRT is
very performant, but does not have the full set of MxNet’s operators. While
that could be addressed with TensorRT plugins, it’s much simpler to reuse
already-exisitng MxNet operators. Also, the user shouldn’t care about
knowing which operators are supported by TensorRT and which ones aren’t -
runtime integration allows the graph partitioner to extract subgraphs
capable of running inside of TensorRT, place the subgraph in a TensorRT
operator in MxNet, execute that operator as part of MxNet’s graph
execusion, and handle non-TensorRT-compatible nodes as regular MxNet
operators remaining after the TensorRT subgraph extraction and node
substitution. The goal is to accelerate inference without changing user
experience.Design considerations 1. Since TensorRT can only determine all
possible optimizations once the tensor shapes are known, it is imperative
that all the shape information be provided. This means that the best time
to construct the TensorRT graph is bind time. The coming PR can selectively
apply the TensorRT optimization for inference-only graphs at symbol bind
time. This is in fact consistent with the assumptions about TensorRT made
on the MxNet Wiki here
.
2. Since as mentioned in #1, TensorRT graph building needs shape
information only available at bind time, an important goal was not to
disrupt any existing APIs. Even though C++ permits default function
arguments, the Python bindings for symbol-related methods (e.g. simple
bind) are exposed via a C, not C++, API, wired on the Python side using
Ctypes (e.g. see here

for the simple bind integration). This precludes the addition of extra
arguments without causing breaking changes in the C API. Also, adapting the
Python code to such changes wouldn’t be enough, since all frontend
languages use the C (not C++) API for the FFI. Fortunately, C API changes
could be avoided, by simply letting the user enable or disable the TensorRT
pass using an environment variable (USE_TENSORRT=1 to enable). This also
does not diminish the flexibility of the integration, since the graph pass
can read the environment variable each time symbol binding is done, and
hence permits turning the graph passes on and off, depending on need. The
ability to enable and disable the TensorRT pass at runtime also makes unit
testing easier.3. TensorRT requires that the workspace size is provided at
graph construction time. This value constitutes the upper limit on the
amount of memory that TensorRT can use, and does not determine immediate
use. Since this amount can be hard for the user to know, its limit should
be set to a reasonable value that the user need not concern themselves
with. Given that TensorRT integration is applied at bind time and that
TensorRT engines wrapped in TensorRT nodes are constructed during the graph
pass rather than the memory allocation pass,  MxNet will only allocate the
amount needed for the nodes remaining after the TensorRT subgraphs have
been extracted. This means that no memory will be doubly allocated - first
for the complete MxNet subgraph and then for TensorRT. However, the
question remains whether the memory used per TensorRT engine should be a
configurable parameter, either as a method argument or an environment
variable, or whether TensorRT should be able to use the maximum available
GPU memory and then reserve only what it needs. I would like to suggest the
latter. Since the TensorRT subgraph will typically use less memory than the
same subgraph in MxNet (due to more layer fusion), it’s 

Proposal of MxNet-to-ONNX exporter

2018-05-21 Thread Marek Kolodziej
def mxnet.contrib.onnx._export.export_model( sym, params, input_shape,
output):
"""
Exports a given MXNet symbol object file or path to saved file to ONNX
model file.
Input Parameters -
--
sym -
A str object (path to json file) or mxnet symbol object or
checkpointed mxnet symbol object
params -
A str object (path to params file) or dict object containing the
model params
input_shape -
list of tuple object , specifies the shape of each input to the
model
output -
path to the output file, including the filename.  Default: current
path , filename: model.onnx

Return Type –
--
onnx_model_path -
str object , path to saved .onnx file.
"""
...
return onnx_model_path