[ https://issues.apache.org/jira/browse/MADLIB-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659430#comment-16659430 ]
Frank McQuillan commented on MADLIB-1268: ----------------------------------------- We transformed the MNIST dataset using affine transformations to get ~ 1.76 million rows using the idea from https://www.cs.toronto.edu/~tijmen/affNIST/u !screenshot-2.png! For 1, 4, 8, 16 segments, seeing expected accuracy curves. The thing with this data set is that it converges so quickly after a couple iterations so it is not the best to see MPP gains. Model: {code} model = Sequential() "model.add(Conv2D(32, kernel_size=(3, 3), model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(32,32,3,))) " model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(n_classes, activation='softmax')) "model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['categorical_accuracy'])" {code} > Spike - CNN convergence, data parallel with merge > ------------------------------------------------- > > Key: MADLIB-1268 > URL: https://issues.apache.org/jira/browse/MADLIB-1268 > Project: Apache MADlib > Issue Type: New Feature > Components: Deep Learning > Reporter: Frank McQuillan > Assignee: Frank McQuillan > Priority: Major > Fix For: v1.16 > > Attachments: screenshot-1.png, screenshot-2.png > > > Story > `As a MADlib developer` > I want investigate convergence behavior when running a single distributed CNN > model across the Greenplum cluster using Keras with a Tensorflow backend > `so that` > I can see if it converges in a predictable and expected way. > Details > * By "single distributed CNN model" I mean data parallel with merge (not > model parallel) [6,7,8]. > * In defining the merge function, review [1] for single-server, multi-GPU > merge function, or use standard MADlib weighted average approach. > * For dataset, consider MNIST and/or CIFAR-10. A bigger data set like Places > http://places2.csail.mit.edu/ may also be useful. > Acceptance > 1) Plot characteristic curves of loss vs. iteration number. Compare with > MADlib merge (this story) vs. without merge. > 2) Define what the merge function is for CNN. Is it the same as [1] or > something else? Does it operate on weights only or does it need gradients? > 3) What does the architecture look like? Draw a diagram showing sync/merge > step for distributed model training. > 4) What tests do we need to do to convince ourselves that the architecture is > valid? > 5) Do we need to write different merge functions, or have a different > approach, for each different neural net type algorithm? Or is there a > general approach that we can use that will apply to this class of algorithms? > References > [1] Check for “# Merge outputs under expected scope” section in the python > program > > https://github.com/keras-team/keras/blob/bf1378f39d02b7d0b53ece5458f9275ac8208046/keras/utils/multi_gpu_utils.py > [2] Single Machine Data Parallel multi GPU Training > https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/ > [3] Why are GPUs necessary for training Deep Learning models? > https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/ > [4] Deep Learning vs Classical Machine Learning > https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa > [5] TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed > Systems > https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45166.pdf > [6] Demystifying Parallel and Distributed Deep Learning: An In-Depth > Concurrency Analysis > https://arxiv.org/pdf/1802.09941.pdf > * see section 7.4.2 for discussion on model averaging > [7] Deep learning with Elastic Averaging SGD > https://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.pdf > * use momentum and Nesterov methods in model averaging computation > [8] SCALABLE TRAINING OF DEEP LEARNING MACHINES BY INCREMENTAL BLOCK > TRAINING WITH INTRA-BLOCK PARALLEL OPTIMIZATION AND BLOCKWISE > MODEL-UPDATE FILTERING > https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/0005880.pdf > * similar to [7], uses momentum methods -- This message was sent by Atlassian JIRA (v7.6.3#76005)