access2rohit opened a new issue #19585:
URL: https://github.com/apache/incubator-mxnet/issues/19585


   ## Problem statement
   I noticed that memory pool keeps memory allocated to MXNet process so 
NDArray or tensors can be allocated faster by our pool. At times the pool size 
becomes very large and memory may not be released to the pool immediately once 
NDArray goes out of scope. When I was running Large tensor Nightly Tests(all at 
once sequentially) then I saw certain tests were causing OOM (even on 720GB RAM 
machine, p2.16xl) even when they individually took less than 50GB memory. When 
I added LOG(INFO) to check how much memory MXNet was requesting in bytes it was 
roughly asking for 7500-8500 GB.
   Perhaps memory is not being released back to the pool after tensors go out 
of scope or there could be an internal memory fragmentation issue in the pool 
itself. These are my observations from test runs and past experiences of going 
through “pooled_storage_manager”. I will dive deep into it and try to come up 
with a suggestion.
   
   ## Proposed solutions
   1. Make MXNET_GPU_MEM_POOL_TYPE=Unpooled to also apply to CPU basically use 
similar strategies for CPU memory 
   2. Fix fragmentation issue within the pool if any


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to