Good suggestion Kellen! I like the idea, it will solve an existing deficiency in MXNet, that has been worked around so far. As an example, the recently added Scala inference API (part of 1.2RC) implemented a dispatcher in Scala to workaround that limitation.
Would be great to better understand the changes you are planning in finer details though. Hagay On Thu, May 10, 2018 at 7:42 AM, kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Hello MXNet developers, > > > > I’ve recently been speaking with users who’d like to run parallel inference > requests with MXNet on their service. They’ll do this on GPUs, and due to > resource constraints, they’d like to do this without duplicating their > model’s weights in memory. They’d also like run inference with a low > degree of buffering/batching as latency is important. I’ve created a wiki > page with a small proposal that I hope will make running parallel inference > a little easier. I’d like to discuss the proposal in this thread and would > particularly appreciate it if core devs could correct me if I’ve made any > incorrect assumptions in the doc. > > > Proposal here: > https://cwiki.apache.org/confluence/display/MXNET/ > Parallel+Inference+in+MXNet > > > > If people are OK with the proposal I can open a Jira ticket, PR, etc. If > people are curious about perf implications I can also do some benchmarking. > > > > Thanks in advance for the feedback, > > -Kellen >