[jira] [Updated] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

Frank McQuillan (JIRA) Wed, 13 Mar 2019 11:09:18 -0700


     [ 
https://issues.apache.org/jira/browse/MADLIB-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Frank McQuillan updated MADLIB-1268:
------------------------------------
    Description: 
Story

`As a MADlib developer`
I want investigate convergence behavior when running a single distributed CNN 
model across the Greenplum cluster using Keras with a Tensorflow backend
`so that`
I can see if it converges in a predictable and expected way.


Details

* By "single distributed CNN model" I mean data parallel with merge (not model 
parallel) [6,7,8].
* In defining the merge function, review [1] for single-server, multi-GPU merge 
function, or use standard MADlib weighted average approach.
* For dataset, consider MNIST and/or CIFAR-10.  A bigger data set like Places 
http://places2.csail.mit.edu/ may also be useful.

Acceptance

1) Plot characteristic curves of loss vs. iteration number.  Compare with 
MADlib merge (this story) vs. without merge.
2) Define what the merge function is for CNN.  Is it the same as [1] or 
something else? Does it operate on weights only or does it need gradients?

References

[1] Check for “# Merge outputs under expected scope” section in the python 
program
 
https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py

[2] Single Machine Data Parallel multi GPU Training 
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

[3] Why are GPUs necessary for training Deep Learning models?
https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/

[4] Deep Learning vs Classical Machine Learning
https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa

[5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
Systems
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf

[6] Demystifying Parallel and Distributed Deep Learning: An In-Depth 
Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf
* see section 7.4.2 for discussion on model averaging

[7] Deep learning with Elastic Averaging SGD
https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf
*  use momentum and Nesterov methods in model averaging computation

[8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK
TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE
MODEL-UPDATE FILTERING
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf
* similar to [7], uses momentum methods

  was:
Story

`As a MADlib developer`
I want investigate convergence behavior when running a single distributed CNN 
model across the Greenplum cluster using Keras with a Tensorflow backend
`so that`
I can see if it converges in a predictable and expected way.


Details

* By "single distributed CNN model" I mean data parallel with merge (not model 
parallel) [6,7,8].
* In defining the merge function, review [1] for single-server, multi-GPU merge 
function, or use standard MADlib weighted average approach.
* For dataset, consider MNIST and/or CIFAR-10.  A bigger data set like Places 
http://places2.csail.mit.edu/ may also be useful.

Acceptance

1) Plot characteristic curves of loss vs. iteration number.  Compare with 
MADlib merge (this story) vs. without merge.
2) Define what the merge function is for CNN.  Is it the same as [1] or 
something else? Does it operate on weights only or does it need gradients?
3) What does the architecture look like?  Draw a diagram showing sync/merge 
step for distributed model training.
4) What tests do we need to do to convince ourselves that the architecture is 
valid?  
5) Do we need to write different merge functions, or have a different approach, 
for each different neural net type algorithm?  Or is there a general approach 
that we can use that will apply to this class of algorithms?


References

[1] Check for “# Merge outputs under expected scope” section in the python 
program
 
https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py

[2] Single Machine Data Parallel multi GPU Training 
https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

[3] Why are GPUs necessary for training Deep Learning models?
https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/

[4] Deep Learning vs Classical Machine Learning
https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa

[5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
Systems
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf

[6] Demystifying Parallel and Distributed Deep Learning: An In-Depth 
Concurrency Analysis
https://arxiv.org/pdf/1802.09941.pdf
* see section 7.4.2 for discussion on model averaging

[7] Deep learning with Elastic Averaging SGD
https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf
*  use momentum and Nesterov methods in model averaging computation

[8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK
TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE
MODEL-UPDATE FILTERING
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf
* similar to [7], uses momentum methods


> Spike - CNN convergence, data parallel with merge
> -------------------------------------------------
>
>                 Key: MADLIB-1268
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1268
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Deep Learning
>            Reporter: Frank McQuillan
>            Assignee: Frank McQuillan
>            Priority: Major
>             Fix For: v1.16
>
>         Attachments: CIFAR-10 validation accuracy by iteration.png, 
> screenshot-1.png, screenshot-2.png, training time vs accuracy for different 
> cluster sizes.png
>
>
> Story
> `As a MADlib developer`
> I want investigate convergence behavior when running a single distributed CNN 
> model across the Greenplum cluster using Keras with a Tensorflow backend
> `so that`
> I can see if it converges in a predictable and expected way.
> Details
> * By "single distributed CNN model" I mean data parallel with merge (not 
> model parallel) [6,7,8].
> * In defining the merge function, review [1] for single-server, multi-GPU 
> merge function, or use standard MADlib weighted average approach.
> * For dataset, consider MNIST and/or CIFAR-10.  A bigger data set like Places 
> http://places2.csail.mit.edu/ may also be useful.
> Acceptance
> 1) Plot characteristic curves of loss vs. iteration number.  Compare with 
> MADlib merge (this story) vs. without merge.
> 2) Define what the merge function is for CNN.  Is it the same as [1] or 
> something else? Does it operate on weights only or does it need gradients?
> References
> [1] Check for “# Merge outputs under expected scope” section in the python 
> program
>  
> https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py
> [2] Single Machine Data Parallel multi GPU Training 
> https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
> [3] Why are GPUs necessary for training Deep Learning models?
> https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
> [4] Deep Learning vs Classical Machine Learning
> https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa
> [5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
> Systems
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf
> [6] Demystifying Parallel and Distributed Deep Learning: An In-Depth 
> Concurrency Analysis
> https://arxiv.org/pdf/1802.09941.pdf
> * see section 7.4.2 for discussion on model averaging
> [7] Deep learning with Elastic Averaging SGD
> https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf
> *  use momentum and Nesterov methods in model averaging computation
> [8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK
> TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE
> MODEL-UPDATE FILTERING
> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf
> * similar to [7], uses momentum methods



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

Reply via email to