From: Moises Hernandez <[email protected]>
Date: Monday, November 16, 2020 at 7:51 AM
To: apachemxnetday <[email protected]>
Subject: Proposal/abstract "Serving 1 Million BERT inference requests for 20 
cents"

"Serving 1 Million BERT inference requests for 20 cents"

Attention based models like BERT have revolutionized natural language 
processing (NLP) due to its ability to outperform traditional models on 
language tasks as shown by their high scores on various NLP benchmarks However, 
even smaller BERT models have more than 100 million parameters, making it 
difficult to achieve near-real-time Inference speeds on generic compute 
hardware. GPUs have generally been outperforming CPUs for BERT Inference, 
however they generally cost more than CPU instances on AWS. New Tensor core 
GPUs have been proven to be more cost-effective and efficient for running 
Inference models. In this talk, we will present a solution for performing 
inference on the popular BERT model in less than 4ms using Nvidia T4 GPUs on 
AWS EC2-G4 Instance. We will cover specific optimizations on the model layers, 
such as Softmax, Bias terms addition, Gaussian Error Linear Units and 
Multi-Head attention that can significantly accelerate the BERT Inference 
performance. Our solution is built to improve performance of NLP tasks like 
Question Answering and Classification tasks like Sentiment Analysis and Domain 
Classification. All this work has been implemented as part of Apache MXNet and 
Gluon NLP frameworks, and has been made available as part of the latest MXNet 
release Lastly, we will cover how a user can leverage the power of SageMaker to 
deploy the optimized BERT model and be able to serve One Million BERT Inference 
requests for less than 20 cents on AWS.

Reply via email to