[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179344795 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. Review comment: gives rise to the name Mixed precision: why capital M? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179344737 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. Review comment: resources required for training deep neural networks has -> resources required for training deep neural networks have This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179343748 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. + +The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. + +The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU. + +In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy. + +## Prerequisites + +- Volta range of Nvidia GPUs +- Cuda 9 or higher +- CUDNN v7 or higher + +## Using the Gluon API + +With Gluon, we need to take care of two things to convert a model to support float16. +1. Cast the Gluon Block, so as to cast the parameters of layers and change the type of input expected, to float16. +2. Cast the data to float16 to match the input type expected by the blocks if necessary. + +### Training +Let us look at an example of training a Resnet50 model with the Caltech101 dataset with float16. +First, let us get some import stuff out of the way. + + +```python +import os +import tarfile +import multiprocessing +import time +import numpy as np +import mxnet as mx +from mxnet import nd, autograd, gluon +from mxnet.gluon.model_zoo import vision as models +from mxnet.metric import Accuracy +from mxnet.gluon.data.vision.datasets import ImageFolderDataset +``` + +Let us start by fetching the Caltech101 dataset and extracting it. + + +```python +url = "https://s3.us-east-2.amazonaws.com/mxnet-public/101_ObjectCategories.tar.gz; +dataset_name = "101_ObjectCategories" +data_folder = "data" +if not os.path.isdir(data_folder): +os.makedirs(data_folder) +tar_path = mx.gluon.utils.download(url, path='data') +if (not os.path.isdir(os.path.join(data_folder, "101_ObjectCategories")) or +not os.path.isdir(os.path.join(data_folder, "101_ObjectCategories_test"))): +tar = tarfile.open(tar_path, "r:gz") +tar.extractall(data_folder) +tar.close() +print('Data extracted') +training_path = os.path.join(data_folder, dataset_name) +testing_path = os.path.join(data_folder, "{}_test".format(dataset_name)) +``` + +Now we have the images in two folders, one for training and the other for test. Let us next create Gluon Dataset from these folders, and then create Gluon DataLoader from those datasets. Let us also define a transform function so that each image loaded is resized, cropped and transposed. + + +```python +EDGE = 224 +SIZE = (EDGE, EDGE) +NUM_WORKERS = multiprocessing.cpu_count() +# Lower batch size if you run out of memory on your GPU +BATCH_SIZE = 64 + +def transform(image, label): +resized = mx.image.resize_short(image, EDGE) +cropped, crop_info = mx.image.center_crop(resized, SIZE) +transposed = nd.transpose(cropped, (2,0,1)) +return transposed, label + +dataset_train =
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179345236 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. + +The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. + +The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU. + +In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy. + +## Prerequisites + +- Volta range of Nvidia GPUs +- Cuda 9 or higher +- CUDNN v7 or higher + Review comment: Could you start with an overview that the tutorial covers both Gluon and Symbolic APIs? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179345318 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. + +The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. + +The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU. + +In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy. + +## Prerequisites + +- Volta range of Nvidia GPUs +- Cuda 9 or higher +- CUDNN v7 or higher + +## Using the Gluon API + +With Gluon, we need to take care of two things to convert a model to support float16. +1. Cast the Gluon Block, so as to cast the parameters of layers and change the type of input expected, to float16. +2. Cast the data to float16 to match the input type expected by the blocks if necessary. + +### Training +Let us look at an example of training a Resnet50 model with the Caltech101 dataset with float16. +First, let us get some import stuff out of the way. + + +```python +import os +import tarfile +import multiprocessing +import time +import numpy as np +import mxnet as mx +from mxnet import nd, autograd, gluon +from mxnet.gluon.model_zoo import vision as models +from mxnet.metric import Accuracy +from mxnet.gluon.data.vision.datasets import ImageFolderDataset +``` + +Let us start by fetching the Caltech101 dataset and extracting it. + + +```python +url = "https://s3.us-east-2.amazonaws.com/mxnet-public/101_ObjectCategories.tar.gz; +dataset_name = "101_ObjectCategories" +data_folder = "data" +if not os.path.isdir(data_folder): +os.makedirs(data_folder) +tar_path = mx.gluon.utils.download(url, path='data') +if (not os.path.isdir(os.path.join(data_folder, "101_ObjectCategories")) or +not os.path.isdir(os.path.join(data_folder, "101_ObjectCategories_test"))): +tar = tarfile.open(tar_path, "r:gz") +tar.extractall(data_folder) +tar.close() +print('Data extracted') +training_path = os.path.join(data_folder, dataset_name) +testing_path = os.path.join(data_folder, "{}_test".format(dataset_name)) +``` + +Now we have the images in two folders, one for training and the other for test. Let us next create Gluon Dataset from these folders, and then create Gluon DataLoader from those datasets. Let us also define a transform function so that each image loaded is resized, cropped and transposed. + + +```python +EDGE = 224 +SIZE = (EDGE, EDGE) +NUM_WORKERS = multiprocessing.cpu_count() +# Lower batch size if you run out of memory on your GPU +BATCH_SIZE = 64 + +def transform(image, label): +resized = mx.image.resize_short(image, EDGE) +cropped, crop_info = mx.image.center_crop(resized, SIZE) +transposed = nd.transpose(cropped, (2,0,1)) +return transposed, label + +dataset_train =
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179345014 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. + +The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. + +The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU. + +In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy. + +## Prerequisites + +- Volta range of Nvidia GPUs +- Cuda 9 or higher +- CUDNN v7 or higher + +## Using the Gluon API + +With Gluon, we need to take care of two things to convert a model to support float16. +1. Cast the Gluon Block, so as to cast the parameters of layers and change the type of input expected, to float16. +2. Cast the data to float16 to match the input type expected by the blocks if necessary. + +### Training +Let us look at an example of training a Resnet50 model with the Caltech101 dataset with float16. Review comment: Add a reference link to the dataset description? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179345103 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. + +The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. + +The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU. + +In this tutorial we will walk through how one can train deep learning neural networks with mixed precision on supported hardware. We will first see how to use float16 and then some techniques on achieving good performance and accuracy. + +## Prerequisites + +- Volta range of Nvidia GPUs +- Cuda 9 or higher +- CUDNN v7 or higher + +## Using the Gluon API + +With Gluon, we need to take care of two things to convert a model to support float16. +1. Cast the Gluon Block, so as to cast the parameters of layers and change the type of input expected, to float16. +2. Cast the data to float16 to match the input type expected by the blocks if necessary. + +### Training +Let us look at an example of training a Resnet50 model with the Caltech101 dataset with float16. +First, let us get some import stuff out of the way. + + +```python +import os +import tarfile +import multiprocessing +import time +import numpy as np +import mxnet as mx +from mxnet import nd, autograd, gluon +from mxnet.gluon.model_zoo import vision as models +from mxnet.metric import Accuracy +from mxnet.gluon.data.vision.datasets import ImageFolderDataset +``` + +Let us start by fetching the Caltech101 dataset and extracting it. Review comment: Could you add a reminder of how big the dataset is (num images, number of GBs) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16
eric-haibin-lin commented on a change in pull request #10391: [MXNET-139] Tutorial for mixed precision training with float16 URL: https://github.com/apache/incubator-mxnet/pull/10391#discussion_r179344912 ## File path: docs/tutorials/python/float16.md ## @@ -0,0 +1,280 @@ +# Mixed precision training using float16 + +The computational resources required for training deep neural networks has been increasing of late because of complexity of the architectures and size of models. Mixed precision training allows us to reduces the resources required by using lower precision arithmetic. In this approach we train using 16 bit floating points (half precision) while using 32 bit floating points (single precision) for output buffers of float16 computation. This combination of single and half precision gives rise to the name Mixed precision. It allows us to achieve the same accuracy as training with single precision, while decreasing the required memory and training or inference time. + +The float16 data type, is a 16 bit floating point representation according to the IEEE 754 standard. It has a dynamic range where the precision can go from 0.000596046 (highest, for values closest to 0) to 32 (lowest, for values in the range 32768-65536). Despite the decreased precision when compared to single precision (float32), float16 computation can be much faster on supported hardware. The motivation for using float16 for deep learning comes from the idea that deep neural network architectures have natural resilience to errors due to backpropagation. Half precision is typically sufficient for training neural networks. This means that on hardware with specialized support for float16 computation we can greatly improve the speed of training and inference. This speedup results from faster matrix multiplication, saving on memory bandwidth and reduced communication costs. It also reduces the size of the model, allowing us to train larger models and use larger batch sizes. + +The Volta range of Graphics Processing Units (GPUs) from Nvidia have Tensor Cores which perform efficient float16 computation. A tensor core allows accumulation of half precision products into single or half precision outputs. For the rest of this tutorial we assume that we are working with Nvidia's Tensor Cores on a Volta GPU. Review comment: Put a reference link to Tensor Cores? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services