Parallel Inference Proposal

kellen sunderland Thu, 10 May 2018 07:43:17 -0700

Hello MXNet developers,



I’ve recently been speaking with users who’d like to run parallel inference
requests with MXNet on their service.  They’ll do this on GPUs, and due to
resource constraints, they’d like to do this without duplicating their
model’s weights in memory.  They’d also like run inference with a low
degree of buffering/batching as latency is important.  I’ve created a wiki
page with a small proposal that I hope will make running parallel inference
a little easier.  I’d like to discuss the proposal in this thread and would
particularly appreciate it if core devs could correct me if I’ve made any
incorrect assumptions in the doc.


Proposal here:
https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet



If people are OK with the proposal I can open a Jira ticket, PR, etc.  If
people are curious about perf implications I can also do some benchmarking.



Thanks in advance for the feedback,

-Kellen

Parallel Inference Proposal

Reply via email to