anirudh2290 commented on issue #16431: [RFC] MXNet Multithreaded Inference 
Interface
URL: 
https://github.com/apache/incubator-mxnet/issues/16431#issuecomment-562335146
 
 
   Thanks for the thoughtful and valuable comments @arcadiaphy.
   
   > I've deployed many models with scala API, and run them in multiple 
threads. The whole system has run smoothly in production environment for more 
than 2 months.
   
   > The backend of inference is graph executor, which is created for each 
thread with shared model parameters. The executors can be dynamically reshaped 
in each thread independently according to the shape of the data input.
   
   Yes, if I am not mistaken this is very similar to how the C Predict API 
supports multi threaded inference today.
   
   > Like what's mentioned above, the dependency engine is not thread safe, so 
if you run it in threaded engine, dead lock and core dump will happen. 
Therefore, naive engine is the only option left. Without the dependency 
scheduling, any write dependency on model parameters is likely to be executed 
simultaneously and mess the internal data. If mkldnn is used to accelerate 
inference, you will get non-deterministic results per inference because mxnet 
stealthily reorder the data in ndarray (write dependency involved) for mkldnn 
operators. I've used a temporary method to address this issue which is not 
suitable for an official PR.
   
   This is a very useful point. In my proposal, I was concentrating mostly on 
ThreadedEngine and not NaiveEngine. Though, recently I added tests for 
NaiveEngine in my PR and everything seemed to be working fine. Till now I have 
not been able to reproduce the correctness issue that you mention with MKLDNN 
(hidden write) and NaiveEngine, but it could be because the Reorder doesnt 
happen in the spawned thread. Here is my test: 
https://github.com/apache/incubator-mxnet/pull/16654/files#diff-1335fbaf3930b1438d9be18edb07a1a6R1384
 . Not sure, if something changed with MKLDNN 1.0 or my test doesnt catch that 
use case, will dig more into this. 
   
   
   > Multithreaded inference should be used with caution. Sharing model 
parameters can reduce the memory footprint in your program, but a lot of memory 
usage is consumed by global resources (temporary workspace, random number 
generator, ...) or op cache for mkldnn which are stored in static thread_local 
variables. So thread number is the most important factor for memory footprint, 
any thread involving mxnet operation, be it any trivial imperative invoking of 
operators, will incur memory overhead by creating its own set of thread_local 
variables. I've spent so much time tracking down memory leak and the best 
solution is to limit thread number.
   
   > A new method to do multithreaded inference by threaded engine is much 
welcomed here. It will solve the above issues automatically and ensure result 
correctness by enforcing dependency checking.
   
   Yes, the earlier approach which has one graph executor per thread, may have 
a lot of memory consumption for global resources. Sharing the cached op will 
alleviate the pain. As you know, we still have a lot of customers using graph 
executor as the backend. Would be a great add, if you are interested to 
contribute towards making graph executor also thread safe for inference use 
cases.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to