QueensGambit edited a comment on issue #16173: Saving and loading cudNN 
autotune and graph optimization
URL: 
https://github.com/apache/incubator-mxnet/issues/16173#issuecomment-537934625
 
 
   Regarding the export of a TensorRT executor handle (@Caenorst),
   the [ONNX-TensorRT repository](https://github.com/onnx/onnx-tensorrt) 
provides an executable to generate an TensorRT engine file from an ONNX-model:
   
   ```
   onnx2trt my_model.onnx -o my_engine.trt
   ```
   Alternatively, one can use the the C++-API instead:
   
   ```
   NvOnnxParser.h
   NvOnnxParserTypedefs.h
   ```
   
   Later the engine file can be reloaded from memory:
   Here is an example python code for this using code fragements from 
https://github.com/onnx/onnx-tensorrt/issues/180 and 
https://github.com/NVIDIA/object-detection-tensorrt-example/blob/master/SSD_Model/utils/common.py.
 
   Unfortunately, I haven't found an example in C++ for this yet:
   
   
   ```python
   import pycuda.autoinit
   import pycuda.driver as cuda
   import tensorrt as trt
   import numpy as np
   
   trt_engine_path = 'my_engine.trt'
   # initialize
   TRT_LOGGER = trt.Logger(trt.Logger.INFO)
   trt.init_libnvinfer_plugins(TRT_LOGGER, '')
   runtime = trt.Runtime(TRT_LOGGER)
   
   # https://github.com/onnx/onnx-tensorrt/issues/180
   def allocate_buffers(engine):
       """
       Allocates all buffers required for the specified engine
       """
       inputs = []
       outputs = []
       bindings = []
       # Iterate over binding names in engine
       for binding in engine:
           # Get binding (tensor/buffer) size
           size = trt.volume(engine.get_binding_shape(binding)) * 
engine.max_batch_size
           # Get binding (tensor/buffer) data type (numpy-equivalent)
           dtype = trt.nptype(engine.get_binding_dtype(binding))
           # Allocate page-locked memory (i.e., pinned memory) buffers
           host_mem = cuda.pagelocked_empty(size, dtype)
           # Allocate linear piece of device memory
           device_mem = cuda.mem_alloc(host_mem.nbytes)
           # Append the device buffer to device bindings
           bindings.append(int(device_mem))
           # Append to inputs/ouputs list
           if engine.binding_is_input(binding):
               inputs.append(HostDeviceMem(host_mem, device_mem))
           else:
               outputs.append(HostDeviceMem(host_mem, device_mem))
       # Create a stream (to eventually copy inputs/outputs and run inference)
       stream = cuda.Stream()
       return inputs, outputs, bindings, stream
   
   def infer(context, bindings, inputs, outputs, stream, batch_size=1):
       """
       Infer outputs on the IExecutionContext for the specified inputs
       """
       # Transfer input data to the GPU
       [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
       # Run inference
       context.execute_async(batch_size=batch_size, bindings=bindings, 
stream_handle=stream.handle)
       # Transfer predictions back from the GPU
       [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
       # Synchronize the stream
       stream.synchronize()
       # Return the host outputs
       return [out.host for out in outputs]
   
   # 
https://github.com/NVIDIA/object-detection-tensorrt-example/blob/master/SSD_Model/utils/common.py
   # Simple helper data class that's a little nicer to use than a 2-tuple.
   class HostDeviceMem(object):
       def __init__(self, host_mem, device_mem):
           self.host = host_mem
           self.device = device_mem
   
       def __str__(self):
           return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
   
       def __repr__(self):
           return self.__str__()
   
   image = np.zeros((1, 3, 224, 224))  # dummy data
   
   # Read the serialized ICudaEngine
   with open(trt_engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
       # Deserialize ICudaEngine
       engine = runtime.deserialize_cuda_engine(f.read())
   # Now just as with the onnx2trt samples...
   # Create an IExecutionContext (context for executing inference)
   with engine.create_execution_context() as context:
       # Allocate memory for inputs/outputs
       inputs, outputs, bindings, stream = allocate_buffers(engine)
       # Set host input to the image
       inputs[0].host = image
       # Inference
       trt_outputs = infer(context, bindings=bindings, inputs=inputs, 
outputs=outputs, stream=stream)
       # Prediction
       pred_id = np.argmax(trt_outputs[-1])
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to