Hi Kellen, Great to see some progress on this as it is one of the major problems we face right now. Your approach seems to be a good fit for a short-/mid-term solution. Have you also considered using some sort of signaling? As far as I understand from your proposal and the example code, leveraging the 'can_read' attribute requires busy waiting in the main thread. An approach similar to Unix signals where the caller registers a handler that gets invoked when an NDArray is ready can potentially offer greater scalability.
-Can On Thu, May 10, 2018, at 10:42, kellen sunderland wrote: > Hello MXNet developers, > > > > I’ve recently been speaking with users who’d like to run parallel inference > requests with MXNet on their service. They’ll do this on GPUs, and due to > resource constraints, they’d like to do this without duplicating their > model’s weights in memory. They’d also like run inference with a low > degree of buffering/batching as latency is important. I’ve created a wiki > page with a small proposal that I hope will make running parallel inference > a little easier. I’d like to discuss the proposal in this thread and would > particularly appreciate it if core devs could correct me if I’ve made any > incorrect assumptions in the doc. > > > Proposal here: > https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet > > > > If people are OK with the proposal I can open a Jira ticket, PR, etc. If > people are curious about perf implications I can also do some benchmarking. > > > > Thanks in advance for the feedback, > > -Kellen