renganxu commented on issue #14047: mxnet.base.MXNetError: Cannot find argument 
'cudnn_algo_verbose'
URL: 
https://github.com/apache/incubator-mxnet/issues/14047#issuecomment-466573773
 
 
   Hi @ptrendx Since I had performance issue when running on bare-metal, I also 
start to use the NGC container ngc18.11_mxnet to run the MLPerf mxnet resnet50 
benchmark on our servers but it cannot converge. Each server has 4 V100-SXM2 32 
GB. I ran the benchmark on two servers and set the parameters the same as DGX-1:
   ```
   --batch-size=208
   --kv-store=horovod
   --lr=0.6
   --warmup-epochs=5
   --dali-prefetch-queue=2
   --dali-nvjpeg-memory-padding=64
   ```
   Here the batch size is 208 for each GPU, so the global batch size is 
208*8=1664 which is the same batch size DGX-1 used in the MLPerf published 
result. But the model cannot reach the target accuracy 74.9% even with 100 
epochs. The evaluation accuracy is 74.35% after 100 epochs (see the following 
figure). But DGX-1 reached 75.22% after only 62 epochs. 
   
![image](https://user-images.githubusercontent.com/3160803/53275305-29895e80-36c0-11e9-88fa-6b7f89b366ab.png)
   
   So could you give some guidance on how to choose the parameters to make the 
model converge and converge faster? Thanks.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to