Zha0q1 commented on issue #19918: URL: https://github.com/apache/incubator-mxnet/issues/19918#issuecomment-781855925
> We saw this error before. The problem happens when you make a new data loader while the previous data loader is not yet fully destroyed (including the data it produced in shared memory). Workers in the new data loader inherit those shared memory ndarrays (without increasing the usage counter which exists in the shared memory region itself) and once python's garbage collector decides to destroy them, they decrement the usage counter (and so it gets decremented too much). There are 2 things that may happen then - either the workers destroy the ndarray and the main process gets this error, or the main process does it and then workers get this error and crash, which results in a hang. > > I made a small workaround for this in our container by inserting waitall and forcing python gc before the fork. I will make a pr tomorrow with this workaround. That's great thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
