ptrendx commented on issue #19360:
URL: 
https://github.com/apache/incubator-mxnet/issues/19360#issuecomment-710129655


   We recently saw this issue too and I am looking for a fix now. I do not 
believe it is CUDA 11 specific, rather code layout/timing/environment specific 
- e.g. in our setup we did not see this issue on Ubuntu 18.04 but encounter it 
on 20.04. The problem is that MXNet does not actually wait for the side thread 
to finish before the program teardown. During the main thread teardown CUDA 
deinitializes itself. If the side thread is still running at this point and 
tries to destroy its mshadow stream, this calls `cudnnDestroy` on the cuDNN 
handle, which internally calls `cudaStreamDestroy` on cuDNN internal CUDA 
streams (CUDA is statically linked in cuDNN, which is why you see your segfault 
coming from `libcudnn_ops_infer.so.8`). When this call is done after the CUDA 
deinitialization, crash happens.
   
   I started looking at this yesterday - brief look at the destructors seems to 
imply that `join` should actually be called on the side threads, so not yet 
sure why this does not actually do the right thing. If anyone has more 
experience with the internals of the `ThreadedEnginePerDevice` I would be happy 
to leave that issue to them, but poking in the meantime.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to