ptrendx commented on issue #17782:
URL: 
https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-732443750


   We encountered this issue when moving to Ubuntu 20.04 and Python 3.8. The 
issue does not happen during any single test, but running multiple dataworker 
tests in a row triggers it (the repro that we used for investigation was
   ```
   nosetests --verbose -s 
tests/python/unittest/test_gluon_data.py:test_multi_worker{,_shape,_forked_data_loader,_dataloader_release_pool}
   ```
   The root cause seems to be how Python memory management interacts with 
forking. Basically if there are still some shared memory `NDArrays` present 
(because of garbage collection or the running operation) when the new 
dataloader from the subsequent test is created, the child processes get the 
copies of that `NDArray` (without actually incrementing the usage count on it), 
since the usage counter as well as the actual data is in the shared memory 
region instead of the memory space of the process calling `fork`. So after the 
fork, when the Python garbage collection kicks in, all of the processes (both 
the parent actually holding that `NDArray` in the first place as well as its 
children) try to destroy the `NDArray`. Now 2 scenarios can happen:
    - the parent is "lucky" and destroys it first - then the counter becomes 0 
and the children then die because of this `CHECK` inside 
`cpu_shared_storage_manager.h` -> this results in a hang of the dataloader as 
observed in #17774
    - the parent is "unlucky" and one of the children destroys the `NDArray` 
first -> parent hits the `CHECK` itself and dies with the error from this issue.
   
   We intend to workaround this issue for our upcoming release by inserting
   ```python
   mx.nd.waitall()
   import gc
   gc.collect()
   ```
   in the Gluon Dataloader contructor (which made the error disappear in our 
tests) but a more robust solution should be devised (maybe increment all the 
shared memory arrays during fork?).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to