szhengac opened a new issue #19155: URL: https://github.com/apache/incubator-mxnet/issues/19155
When I pretrained a 3-layer BERT model using GluonNLP 0.10 on one p3.24dn instance with 32GB GPU memory, I received `CUDA: Check failed: e == cudaSuccess: misaligned address`. With batch size 128 in total, it uses 11GB GPU memory and no error occurs. But when I slightly increased the total batch size to 176 or double it to 256, I received the error. I have cherry-picked https://github.com/apache/incubator-mxnet/pull/17767. @sxjscience you may want to try the setting in numpy version. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
