[ https://issues.apache.org/jira/browse/MADLIB-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan updated MADLIB-1268: ------------------------------------ Description: Story `As a MADlib developer` I want investigate convergence behavior when running a single distributed CNN model across the Greenplum cluster using Keras with a Tensorflow backend `so that` I can see if it converges in a predictable and expected way. Details * By "single distributed CNN model" I mean data parallel with merge (not model parallel) [6,7,8]. * In defining the merge function, review [1] for single-server, multi-GPU merge function, or use standard MADlib weighted average approach. * For dataset, consider MNIST and/or CIFAR-10. A bigger data set like Places http://places2.csail.mit.edu/ may also be useful. Acceptance 1) Plot characteristic curves of loss vs. iteration number. Compare with MADlib merge (this story) vs. without merge. 2) Define what the merge function is for CNN. Is it the same as [1] or something else? Does it operate on weights only or does it need gradients? References [1] Check for “# Merge outputs under expected scope” section in the python program https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py [2] Single Machine Data Parallel multi GPU Training https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/ [3] Why are GPUs necessary for training Deep Learning models? https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/ [4] Deep Learning vs Classical Machine Learning https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa [5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf [6] Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://arxiv.org/pdf/1802.09941.pdf * see section 7.4.2 for discussion on model averaging [7] Deep learning with Elastic Averaging SGD https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf * use momentum and Nesterov methods in model averaging computation [8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE MODEL-UPDATE FILTERING https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf * similar to [7], uses momentum methods was: Story `As a MADlib developer` I want investigate convergence behavior when running a single distributed CNN model across the Greenplum cluster using Keras with a Tensorflow backend `so that` I can see if it converges in a predictable and expected way. Details * By "single distributed CNN model" I mean data parallel with merge (not model parallel) [6,7,8]. * In defining the merge function, review [1] for single-server, multi-GPU merge function, or use standard MADlib weighted average approach. * For dataset, consider MNIST and/or CIFAR-10. A bigger data set like Places http://places2.csail.mit.edu/ may also be useful. Acceptance 1) Plot characteristic curves of loss vs. iteration number. Compare with MADlib merge (this story) vs. without merge. 2) Define what the merge function is for CNN. Is it the same as [1] or something else? Does it operate on weights only or does it need gradients? 3) What does the architecture look like? Draw a diagram showing sync/merge step for distributed model training. 4) What tests do we need to do to convince ourselves that the architecture is valid? 5) Do we need to write different merge functions, or have a different approach, for each different neural net type algorithm? Or is there a general approach that we can use that will apply to this class of algorithms? References [1] Check for “# Merge outputs under expected scope” section in the python program https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py [2] Single Machine Data Parallel multi GPU Training https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/ [3] Why are GPUs necessary for training Deep Learning models? https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/ [4] Deep Learning vs Classical Machine Learning https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa [5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf [6] Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://arxiv.org/pdf/1802.09941.pdf * see section 7.4.2 for discussion on model averaging [7] Deep learning with Elastic Averaging SGD https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf * use momentum and Nesterov methods in model averaging computation [8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE MODEL-UPDATE FILTERING https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf * similar to [7], uses momentum methods > Spike - CNN convergence, data parallel with merge > ------------------------------------------------- > > Key: MADLIB-1268 > URL: https://issues.apache.org/jira/browse/MADLIB-1268 > Project: Apache MADlib > Issue Type: New Feature > Components: Deep Learning > Reporter: Frank McQuillan > Assignee: Frank McQuillan > Priority: Major > Fix For: v1.16 > > Attachments: CIFAR-10 validation accuracy by iteration.png, > screenshot-1.png, screenshot-2.png, training time vs accuracy for different > cluster sizes.png > > > Story > `As a MADlib developer` > I want investigate convergence behavior when running a single distributed CNN > model across the Greenplum cluster using Keras with a Tensorflow backend > `so that` > I can see if it converges in a predictable and expected way. > Details > * By "single distributed CNN model" I mean data parallel with merge (not > model parallel) [6,7,8]. > * In defining the merge function, review [1] for single-server, multi-GPU > merge function, or use standard MADlib weighted average approach. > * For dataset, consider MNIST and/or CIFAR-10. A bigger data set like Places > http://places2.csail.mit.edu/ may also be useful. > Acceptance > 1) Plot characteristic curves of loss vs. iteration number. Compare with > MADlib merge (this story) vs. without merge. > 2) Define what the merge function is for CNN. Is it the same as [1] or > something else? Does it operate on weights only or does it need gradients? > References > [1] Check for “# Merge outputs under expected scope” section in the python > program > > https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py > [2] Single Machine Data Parallel multi GPU Training > https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/ > [3] Why are GPUs necessary for training Deep Learning models? > https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/ > [4] Deep Learning vs Classical Machine Learning > https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa > [5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed > Systems > https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf > [6] Demystifying Parallel and Distributed Deep Learning: An In-Depth > Concurrency Analysis > https://arxiv.org/pdf/1802.09941.pdf > * see section 7.4.2 for discussion on model averaging > [7] Deep learning with Elastic Averaging SGD > https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf > * use momentum and Nesterov methods in model averaging computation > [8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK > TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE > MODEL-UPDATE FILTERING > https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf > * similar to [7], uses momentum methods -- This message was sent by Atlassian JIRA (v7.6.3#76005)