Hello MXNet developers,
I’ve recently been speaking with users who’d like to run parallel inference requests with MXNet on their service. They’ll do this on GPUs, and due to resource constraints, they’d like to do this without duplicating their model’s weights in memory. They’d also like run inference with a low degree of buffering/batching as latency is important. I’ve created a wiki page with a small proposal that I hope will make running parallel inference a little easier. I’d like to discuss the proposal in this thread and would particularly appreciate it if core devs could correct me if I’ve made any incorrect assumptions in the doc. Proposal here: https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet If people are OK with the proposal I can open a Jira ticket, PR, etc. If people are curious about perf implications I can also do some benchmarking. Thanks in advance for the feedback, -Kellen