MrRaghav commented on issue #18662:
URL: 
https://github.com/apache/incubator-mxnet/issues/18662#issuecomment-655415742


   Hello, thank you for your suggestion. Actually, I've started working on 
machine translation just few days back and wanted to try all the possible 
scenarios before replying to you.
   Before contacting to the developers, I referred 
https://github.com/deepinsight/insightface/issues/257 and already tried by 
reducing default batch size from 4096 to 2048,1024, 512 and many more 
(_according to the mutiple of 2/3 GPUs which I used to allot_ for the job).  
During all these cases, sockeye.train used to fail after 2-3 minutes of running.
   
   But, yesterday I found one combination which 'seems' to have fixed out of 
memory issue. Due to this, I didn't uninstall other versions of mxnet (_as 
suggested by you_) for the time being.
   
   1) I tried with **5 GPUs** and **reduced the batch size to 200**
   2) Following parameters of **sockeye.train** worked okay: **--shared-vocab  
--num-embed 512 --batch-type sentence --batch-size 200 --num-layers 6:6 
--transformer-model-size 512 --device-ids -5 -max-checkpoints 3** and it ran 
for ~33 minutes
       
   3) It didn't prompt any memory issue but this prompted a new error:
           [ERROR:root] Uncaught exception
       Traceback (most recent call last):
         File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
           "__main__", mod_spec)
         File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
           exec(code, run_globals)
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/train.py", line 997, in 
<module>
           main()
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/train.py", line 764, in main
           train(args)
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/train.py", line 992, in 
train
           training_state = trainer.fit(train_iter=train_iter, 
validation_iter=eval_iter, checkpoint_decoder=cp_decoder)
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/training.py", line 264, in 
fit
           val_metrics = self._evaluate(self.state.checkpoint, validation_iter, 
checkpoint_decoder)
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/training.py", line 378, in 
_evaluate
           decoder_metrics = 
checkpoint_decoder.decode_and_evaluate(output_name=output_name)
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/checkpoint_decoder.py", 
line 176, in decode_and_evaluate
           references=self.target_sentences),
         File "/home/ 
username/.local/lib/python3.7/site-packages/sockeye/evaluate.py", line 57, in 
raw_corpus_chrf
           return sacrebleu.corpus_chrf(hypotheses, references, 
order=sacrebleu.CHRF_ORDER, beta=sacrebleu.CHRF_BETA,
       **AttributeError: module 'sacrebleu' has no attribute 'CHRF_ORDER'**
       learning rate from ``lr_scheduler`` has been overwritten by 
``learning_rate`` in optimizer.
   
   4) I have checked it and it doesn't seem to be related with out of memory. 
However, there is a similar issue mentioned under pytorch: 
https://github.com/pytorch/fairseq/issues/2049.
   
   5) I have following versions of scarebleu, sockeye and mxnet
       _sacrebleu           1.4.10
       sockeye             2.1.7
       mxnet               1.6.0
       mxnet-cu101mkl      1.6.0
       mxnet-mkl           1.6.0_
   
   6) I don't think opening random issues in every repository is a good idea 
but I can't find any such issue/solution in the issues section of sockeye, 
mxnet or sacrebleu.
   
   I request to spare few minutes and suggest me if I missed anything.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to