You can have a look to this paper [sublinear memory usage](https://arxiv.org/pdf/1604.06174.pdf), which include some common solutions used by DL frames to lower gpu memory usage.
If you only care about `Forward Inference`, you can try to change the batchsize to a small value (at the cost of speed); quantize the network (int8, float16 etc.), as far as I know, the mkldnn backend of mxnet support this, otherwise, tensorrt also have a good support to model quantization. Some other methods you can also have a try. How about the gpu memory cost of same model using TF or PyTorch? --- [Visit Topic](https://discuss.mxnet.io/t/how-to-limit-gpu-memory-usage/6304/6) or reply to this email to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.mxnet.io/email/unsubscribe/7f545ba3ae224bf60f2f3a069883f19ce8cddb94ff713d3d34151207ae00d5d1).
