Hi Kellen,

Great to see some progress on this as it is one of the major problems we face 
right now. Your approach seems to be a good fit for a short-/mid-term solution. 
Have you also considered using some sort of signaling? As far as I understand 
from your proposal and the example code, leveraging the 'can_read' attribute 
requires busy waiting in the main thread. An approach similar to Unix signals 
where the caller registers a handler that gets invoked when an NDArray is ready 
can potentially offer greater scalability.

-Can

On Thu, May 10, 2018, at 10:42, kellen sunderland wrote:
> Hello MXNet developers,
> 
> 
> 
> I’ve recently been speaking with users who’d like to run parallel inference
> requests with MXNet on their service.  They’ll do this on GPUs, and due to
> resource constraints, they’d like to do this without duplicating their
> model’s weights in memory.  They’d also like run inference with a low
> degree of buffering/batching as latency is important.  I’ve created a wiki
> page with a small proposal that I hope will make running parallel inference
> a little easier.  I’d like to discuss the proposal in this thread and would
> particularly appreciate it if core devs could correct me if I’ve made any
> incorrect assumptions in the doc.
> 
> 
> Proposal here:
> https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet
> 
> 
> 
> If people are OK with the proposal I can open a Jira ticket, PR, etc.  If
> people are curious about perf implications I can also do some benchmarking.
> 
> 
> 
> Thanks in advance for the feedback,
> 
> -Kellen

Reply via email to