Thanks Przemek, appreciate your input. Let me apply the scale changes to
the gradient clips and run the experiment again.

On Fri, May 1, 2020 at 11:20 AM Przemysław Trędak <ptre...@apache.org>
wrote:

> Just realized I did not actually link to the issue I mentioned, it is
> https://github.com/apache/incubator-mxnet/issues/17507
>
> On 2020/05/01 18:19:27, Przemys��aw Tr��dak <ptre...@apache.org> wrote:
> > Hi Naveen,
> >
> > The problem that you see with loss is due to the fact that the model
> clips the gradient, which in the case of AMP is scaled by the loss scale.
> In order for it to work you need to apply the same loss scale to the value
> you are using to clip the gradients. This is currently possible in 2 ways,
> either use amp.unscale API to unscale the gradients before clipping, or use
> (currently quite hackily, there is an open issue [1] to expose it properly)
> trainer._amp_loss_scaler.loss_scale to multiply your intended global norm
> of gradients.
> >
> > The topic of gradient clipping with AMP is a common problem people have
> and it should be included in the tutorial. I intend to update the tutorial
> with an example of this together with other changes intended to bring AMP
> out of contrib.
> >
> > Regarding performance - it is quite hard to say what is the reason of
> this without profiling the application - there could be multiple different
> bottleneck here, other than the actual computation on the GPU.
> >
> > Hope this helps :-)
> > Przemek
> >
> > On 2020/05/01 05:14:39, Naveen Swamy <mnnav...@gmail.com> wrote:
> > > Hello,
> > > I am trying to use AMP on an RNN Model, however I am not seeing higher
> > > throughputs using AMP. also the loss seems to have stagnated. I am
> > > wondering if I am missing something.
> > >
> > > Also has AMP has been tested on any RNN models and if there are any
> > > benchmarks ? Appreciate some input here..
> > >
> > > I used the RNN model here [1] and followed the tutorial in [2], the
> output
> > > of the runs are
> > > ----
> > > Without AMP:
> > > mxnet-lm$ python train.py --cuda --tied --nhid 1500 --emsize 1500
> --epochs
> > > 60  --dropout 0.65 --model gru --batch_size 128
> > >
> > > [Epoch 3 Batch 200/13] loss 6.47, ppl 648.24, throughput 675.94
> samples/s
> > > [Epoch 3 Batch 400/13] loss 6.30, ppl 543.20, throughput 679.51
> samples/s
> > > [Epoch 3] time cost 90.29s, valid loss 5.97, valid ppl 392.94
> > > test loss 5.89, test ppl 361.69
> > > [Epoch 4 Batch 200/13] loss 6.15, ppl 470.58, throughput 676.46
> samples/s
> > > [Epoch 4 Batch 400/13] loss 6.01, ppl 408.21, throughput 679.51
> samples/s
> > > [Epoch 4] time cost 90.27s, valid loss 5.69, valid ppl 296.89
> > >
> > > test loss 5.63, test ppl 277.58
> > > ----
> > > With AMP:
> > >
> > > (gluonnlp) ubuntu@ip-172-30-0-140:~/mxnet-lm$ python train.py --cuda
> --tied
> > > --nhid 1500 --emsize 1500 --epochs 60  --dropout 0.65 --model gru
> > > --batch_size 128 --amp True
> > > Namespace(amp=True, batch_size=128, bptt=35, clip=0.25, cuda=True,
> > > dropout=0.65, emsize=1500, epochs=60, export_model=False,
> gcthreshold=0.5,
> > > gctype='none', hybridize=False, log_interval=200, lr=20, model='gru',
> > > nhid=1500, nlayers=2, save='model.params', static_alloc=False,
> > > static_shape=False, tied=True)
> > > using AMP
> > > INFO:root:Using AMP
> > > [Epoch 3 Batch 200/13] loss 10.43, ppl 34026.18, throughput 685.66
> samples/s
> > > [Epoch 3 Batch 400/13] loss 10.38, ppl 32150.51, throughput 688.99
> samples/s
> > > [Epoch 3] time cost 89.04s, valid loss 10.36, valid ppl 31650.83
> > > test loss 10.36, test ppl 31626.99
> > > INFO:root:AMP: increasing loss scale to 131072.000000
> > > [Epoch 4 Batch 200/13] loss 10.42, ppl 33642.12, throughput 686.83
> samples/s
> > > [Epoch 4 Batch 400/13] loss 10.37, ppl 31839.51, throughput 689.55
> samples/s
> > > ----
> > >
> > > changes made to the training loop after initializing amp and the
> trainer:
> > >
> > > with autograd.record():
> > >     output, hidden = model(data, hidden)
> > >     # Here L is a vector of size batch_size * bptt size
> > >     L = loss(output, target)
> > >     L = L / (args.bptt * args.batch_size)
> > >         with amp.scale_loss(L, trainer) as scaled_loss:
> > >             mx.autograd.backward(scaled_loss)
> > >
> > > ----
> > > [1]
> > >
> https://github.com/apache/incubator-mxnet/blob/master/example/gluon/word_language_model/train.py
> > >
> > > [2]
> > >
> https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html
> > >
> > > Thanks, Naveen
> > >
> >
>

Reply via email to