Zha0q1 commented on issue #19918:
URL: 
https://github.com/apache/incubator-mxnet/issues/19918#issuecomment-781855925


   
   > We saw this error before. The problem happens when you make a new data 
loader while the previous data loader is not yet fully destroyed (including the 
data it produced in shared memory). Workers in the new data loader inherit 
those shared memory ndarrays (without increasing the usage counter which exists 
in the shared memory region itself) and once python's garbage collector decides 
to destroy them, they decrement the usage counter (and so it gets decremented 
too much). There are 2 things that may happen then - either the workers destroy 
the ndarray and the main process gets this error, or the main process does it 
and then workers get this error and crash, which results in a hang.
   > 
   > I made a small workaround for this in our container by inserting waitall 
and forcing python gc before the fork. I will make a pr tomorrow with this 
workaround.
   
   That's great thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to