ptrendx commented on issue #17782:
URL:
https://github.com/apache/incubator-mxnet/issues/17782#issuecomment-732443750
We encountered this issue when moving to Ubuntu 20.04 and Python 3.8. The
issue does not happen during any single test, but running multiple dataworker
tests in a row triggers it (the repro that we used for investigation was
```
nosetests --verbose -s
tests/python/unittest/test_gluon_data.py:test_multi_worker{,_shape,_forked_data_loader,_dataloader_release_pool}
```
The root cause seems to be how Python memory management interacts with
forking. Basically if there are still some shared memory `NDArrays` present
(because of garbage collection or the running operation) when the new
dataloader from the subsequent test is created, the child processes get the
copies of that `NDArray` (without actually incrementing the usage count on it),
since the usage counter as well as the actual data is in the shared memory
region instead of the memory space of the process calling `fork`. So after the
fork, when the Python garbage collection kicks in, all of the processes (both
the parent actually holding that `NDArray` in the first place as well as its
children) try to destroy the `NDArray`. Now 2 scenarios can happen:
- the parent is "lucky" and destroys it first - then the counter becomes 0
and the children then die because of this `CHECK` inside
`cpu_shared_storage_manager.h` -> this results in a hang of the dataloader as
observed in #17774
- the parent is "unlucky" and one of the children destroys the `NDArray`
first -> parent hits the `CHECK` itself and dies with the error from this issue.
We intend to workaround this issue for our upcoming release by inserting
```python
mx.nd.waitall()
import gc
gc.collect()
```
in the Gluon Dataloader contructor (which made the error disappear in our
tests) but a more robust solution should be devised (maybe increment all the
shared memory arrays during fork?).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]