leezu commented on issue #17596: Fix transformer.cu interleaved matmul for cuda arch < 5 URL: https://github.com/apache/incubator-mxnet/pull/17596#issuecomment-586539281 Verified this patch by finetuning Bert on P2 instance. Verification was initially blocked / delayed by https://github.com/apache/incubator-mxnet/pull/17576 ... ``` % python finetune_classifier.py --task_name RTE --batch_size 32 --epochs 3 --gpu 0 --lr 2e-5 INFO:root:01:21:10 Namespace(accumulate=None, batch_size=32, bert_dataset='book_corpus_wiki_en_uncased', bert_model='bert_12_768_12', calib_mode='customize', deploy=False, dev_batch_size=8, dtype='float32', early_stop=None, epochs=3, epsilon=1e-06, gpu=0, log_interval=10, lr=2e-05, max_len=128, model_parameters=None, model_prefix=None, num_calib_batches=5, only_calibration=False, only_inference=False, optimizer='bertadam', output_dir='./output_dir', pretrained_bert_parameters=None, quantized_dtype='auto', round_to=None, seed=2, task_name='RTE', training_steps=None, warmup_ratio=0.1) [01:21:12] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning. INFO:root:01:21:26 processing dataset... INFO:root:01:21:35 Now we are doing BERT classification training on gpu(0)! INFO:root:01:21:35 training steps=233 INFO:root:01:21:45 [Epoch 1 Batch 10/82] loss=0.7479, lr=0.0000078, metrics:accuracy:0.5507 INFO:root:01:21:54 [Epoch 1 Batch 20/82] loss=0.7263, lr=0.0000165, metrics:accuracy:0.5235 INFO:root:01:22:02 [Epoch 1 Batch 30/82] loss=0.6821, lr=0.0000194, metrics:accuracy:0.5306 INFO:root:01:22:12 [Epoch 1 Batch 40/82] loss=0.6718, lr=0.0000185, metrics:accuracy:0.5370 INFO:root:01:22:21 [Epoch 1 Batch 50/82] loss=0.6743, lr=0.0000175, metrics:accuracy:0.5518 INFO:root:01:22:31 [Epoch 1 Batch 60/82] loss=0.6894, lr=0.0000166, metrics:accuracy:0.5551 INFO:root:01:22:39 [Epoch 1 Batch 70/82] loss=0.6872, lr=0.0000156, metrics:accuracy:0.5587 INFO:root:01:22:48 [Epoch 1 Batch 80/82] loss=0.6626, lr=0.0000147, metrics:accuracy:0.5693 INFO:root:01:22:50 Now we are doing evaluation on dev with gpu(0). INFO:root:01:22:51 [Batch 10/35] loss=0.6449, metrics:accuracy:0.6750 INFO:root:01:22:52 [Batch 20/35] loss=0.6266, metrics:accuracy:0.6813 INFO:root:01:22:54 [Batch 30/35] loss=0.6930, metrics:accuracy:0.6625 INFO:root:01:22:54 validation metrics:accuracy:0.6715 INFO:root:01:22:54 Time cost=4.00s, throughput=69.97 samples/s INFO:root:01:22:55 params saved in: ./output_dir/model_bert_RTE_0.params INFO:root:01:22:55 Time cost=79.30s INFO:root:01:23:03 [Epoch 2 Batch 10/82] loss=0.5310, lr=0.0000135, metrics:accuracy:0.7719 INFO:root:01:23:12 [Epoch 2 Batch 20/82] loss=0.5022, lr=0.0000126, metrics:accuracy:0.7650 INFO:root:01:23:22 [Epoch 2 Batch 30/82] loss=0.4835, lr=0.0000116, metrics:accuracy:0.7733 INFO:root:01:23:31 [Epoch 2 Batch 40/82] loss=0.4762, lr=0.0000107, metrics:accuracy:0.7754 INFO:root:01:23:40 [Epoch 2 Batch 50/82] loss=0.4412, lr=0.0000097, metrics:accuracy:0.7728 INFO:root:01:23:48 [Epoch 2 Batch 60/82] loss=0.4915, lr=0.0000088, metrics:accuracy:0.7741 INFO:root:01:23:57 [Epoch 2 Batch 70/82] loss=0.4512, lr=0.0000078, metrics:accuracy:0.7767 INFO:root:01:24:05 [Epoch 2 Batch 80/82] loss=0.3897, lr=0.0000069, metrics:accuracy:0.7832 INFO:root:01:24:06 Now we are doing evaluation on dev with gpu(0). INFO:root:01:24:08 [Batch 10/35] loss=0.6482, metrics:accuracy:0.7125 INFO:root:01:24:09 [Batch 20/35] loss=0.6311, metrics:accuracy:0.7125 INFO:root:01:24:10 [Batch 30/35] loss=0.7034, metrics:accuracy:0.7042 INFO:root:01:24:10 validation metrics:accuracy:0.7076 INFO:root:01:24:10 Time cost=4.00s, throughput=70.06 samples/s INFO:root:01:24:11 params saved in: ./output_dir/model_bert_RTE_1.params INFO:root:01:24:11 Time cost=76.11s INFO:root:01:24:21 [Epoch 3 Batch 10/82] loss=0.2911, lr=0.0000057, metrics:accuracy:0.9125 INFO:root:01:24:30 [Epoch 3 Batch 20/82] loss=0.2762, lr=0.0000048, metrics:accuracy:0.9092 INFO:root:01:24:39 [Epoch 3 Batch 30/82] loss=0.2438, lr=0.0000038, metrics:accuracy:0.9121 INFO:root:01:24:47 [Epoch 3 Batch 40/82] loss=0.2719, lr=0.0000029, metrics:accuracy:0.9077 INFO:root:01:24:56 [Epoch 3 Batch 50/82] loss=0.2787, lr=0.0000019, metrics:accuracy:0.9054 INFO:root:01:25:05 [Epoch 3 Batch 60/82] loss=0.3279, lr=0.0000010, metrics:accuracy:0.9049 INFO:root:01:25:12 Finish training step: 233 INFO:root:01:25:12 Now we are doing evaluation on dev with gpu(0). INFO:root:01:25:14 [Batch 10/35] loss=0.7463, metrics:accuracy:0.7125 INFO:root:01:25:15 [Batch 20/35] loss=0.6660, metrics:accuracy:0.7250 INFO:root:01:25:16 [Batch 30/35] loss=0.7802, metrics:accuracy:0.7125 INFO:root:01:25:16 validation metrics:accuracy:0.7112 INFO:root:01:25:16 Time cost=3.97s, throughput=70.60 samples/s INFO:root:01:25:17 params saved in: ./output_dir/model_bert_RTE_2.params INFO:root:01:25:17 Time cost=65.91s INFO:root:01:25:17 Best model at epoch 2. Validation metrics:accuracy:0.7112 INFO:root:01:25:17 Now we are doing testing on test with gpu(0). INFO:root:01:25:54 Time cost=36.38s, throughput=82.47 samples/s ````
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services