[jira] [Commented] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

Frank McQuillan (JIRA) Mon, 22 Oct 2018 12:06:50 -0700


    [ 
https://issues.apache.org/jira/browse/MADLIB-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659430#comment-16659430
 ]


Frank McQuillan commented on MADLIB-1268:
-----------------------------------------

We transformed the MNIST dataset using affine transformations
to get ~ 1.76 million rows using the idea from
https://www.cs.toronto.edu/~tijmen/affNIST/u

 !screenshot-2.png! 

For 1, 4, 8, 16 segments, seeing expected accuracy curves.  The thing with this 
data set is that it converges so quickly after a couple iterations so it is not 
the best to see MPP gains.

Model:
{code}
model = Sequential()
"model.add(Conv2D(32, kernel_size=(3, 3),
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=(32,32,3,)))
"
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(n_classes, activation='softmax'))
"model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['categorical_accuracy'])"
{code}

> Spike - CNN convergence, data parallel with merge
> -------------------------------------------------
>
>                 Key: MADLIB-1268
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1268
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Deep Learning
>            Reporter: Frank McQuillan
>            Assignee: Frank McQuillan
>            Priority: Major
>             Fix For: v1.16
>
>         Attachments: screenshot-1.png, screenshot-2.png
>
>
> Story
> `As a MADlib developer`
> I want investigate convergence behavior when running a single distributed CNN 
> model across the Greenplum cluster using Keras with a Tensorflow backend
> `so that`
> I can see if it converges in a predictable and expected way.
> Details
> * By "single distributed CNN model" I mean data parallel with merge (not 
> model parallel) [6,7,8].
> * In defining the merge function, review [1] for single-server, multi-GPU 
> merge function, or use standard MADlib weighted average approach.
> * For dataset, consider MNIST and/or CIFAR-10.  A bigger data set like Places 
> http://places2.csail.mit.edu/ may also be useful.
> Acceptance
> 1) Plot characteristic curves of loss vs. iteration number.  Compare with 
> MADlib merge (this story) vs. without merge.
> 2) Define what the merge function is for CNN.  Is it the same as [1] or 
> something else? Does it operate on weights only or does it need gradients?
> 3) What does the architecture look like?  Draw a diagram showing sync/merge 
> step for distributed model training.
> 4) What tests do we need to do to convince ourselves that the architecture is 
> valid?  
> 5) Do we need to write different merge functions, or have a different 
> approach, for each different neural net type algorithm?  Or is there a 
> general approach that we can use that will apply to this class of algorithms?
> References
> [1] Check for “# Merge outputs under expected scope” section in the python 
> program
>  
> https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py
> [2] Single Machine Data Parallel multi GPU Training 
> https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/
> [3] Why are GPUs necessary for training Deep Learning models?
> https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
> [4] Deep Learning vs Classical Machine Learning
> https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa
> [5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed 
> Systems
> https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf
> [6] Demystifying Parallel and Distributed Deep Learning: An In-Depth 
> Concurrency Analysis
> https://arxiv.org/pdf/1802.09941.pdf
> * see section 7.4.2 for discussion on model averaging
> [7] Deep learning with Elastic Averaging SGD
> https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf
> *  use momentum and Nesterov methods in model averaging computation
> [8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK
> TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE
> MODEL-UPDATE FILTERING
> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf
> * similar to [7], uses momentum methods



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1268) Spike - CNN convergence, data parallel with merge

Reply via email to