waytrue17 commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1079551643


   It looks like the memory leak in the above script is because we instantiate 
multiple dataloader objects in the for loop. Having one dataloader object seems 
to mitigate the issue:
   ```
   import mxnet.gluon as gl
   import mxnet as mx
   import gc
   
   if __name__ == "__main__":
       gpu_ctx = mx.gpu()
       model = gl.nn.Embedding(10, 5)
       model.initialize(ctx=gpu_ctx)
       X = mx.random.uniform(shape=(1000, 3))
       dataset = mx.gluon.data.dataset.ArrayDataset(X)
       num_workers = 8
       data_loader = gl.data.DataLoader(
                   dataset,
                   batch_size=1,
                   num_workers=num_workers,
               )
   
       for epoch in range(5):
           for batch in data_loader:
               # move data to gpu
               data_gpu = batch.copyto(mx.gpu())
               # forward
               l = model(data_gpu)
               # force immediate compute
               l.asnumpy()
   
           mx.nd.waitall()
   
           a, b = mx.context.gpu_memory_info(0)
           print(f"num_workers: {num_workers} epoch {epoch}: "
                 f"current gpu memory {(b - a) / (1024 * 1024 * 1024)} GB, "
                 f"Total gpu memory {b / (1024 * 1024 * 1024)} GB.")
           data_loader.refresh()
   ```
   
   ```
   num_workers: 8 epoch 0: current gpu memory 1.43017578125 GB, Total gpu 
memory 15.78192138671875 GB.
   num_workers: 8 epoch 1: current gpu memory 1.43017578125 GB, Total gpu 
memory 15.78192138671875 GB.
   num_workers: 8 epoch 2: current gpu memory 1.43017578125 GB, Total gpu 
memory 15.78192138671875 GB.
   num_workers: 8 epoch 3: current gpu memory 1.43017578125 GB, Total gpu 
memory 15.78192138671875 GB.
   num_workers: 8 epoch 4: current gpu memory 1.43017578125 GB, Total gpu 
memory 15.78192138671875 GB.
   ```
   Seems previously we had `mshadow::DeleteStream<gpu>(stream)` to clean up the 
GPU memory by the life cycle of dataloader object, but it had a [segfault 
issue](https://github.com/apache/incubator-mxnet/issues/19360). In the 
workaround [PR](https://github.com/apache/incubator-mxnet/pull/19378), we 
removed `mshadow::DeleteStream<gpu>(stream)` and relied on the OS to clean up 
memory at the end of the program. That may explain why we see memory leak when 
creating multiple dataloaders in the program.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to