[
https://issues.apache.org/jira/browse/MADLIB-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490060#comment-16490060
]
ASF GitHub Bot commented on MADLIB-1210:
----------------------------------------
GitHub user kaknikhil opened a pull request:
https://github.com/apache/madlib/pull/272
MLP: Add momentum and nesterov to gradient updates.
JIRA: MADLIB-1210
We refactored the minibatch code to separate out the momentum and model
update functions. We initially were using the same function to get the
loss and gradient for both igd and minibatch but the overhead of
creating and updating the total_gradient_per_layer variable made igd
slower. So we decided not to use the same code and are now calling the
model and momentum update functions for both igd and minibatch
Co-authored-by: Rahul Iyer<[email protected]>
Co-authored-by: Jingyi Mei <[email protected]>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib feature/mlp_momentum
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/272.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #272
----
commit 176e197f48732443ce658c5d02cefc8c45e7ff52
Author: Rahul Iyer <riyer@...>
Date: 2018-05-02T12:25:48Z
MLP: Add momentum and nesterov to gradient updates.
JIRA: MADLIB-1210
We refactored the minibatch code to separate out the momentum and model
update functions. We initially were using the same function to get the
loss and gradient for both igd and minibatch but the overhead of
creating and updating the total_gradient_per_layer variable made igd
slower. So we decided not to use the same code and are now calling the
model and momentum update functions for both igd and minibatch
Co-authored-by: Rahul Iyer<[email protected]>
Co-authored-by: Jingyi Mei <[email protected]>
----
> Add momentum methods to MLP
> ---------------------------
>
> Key: MADLIB-1210
> URL: https://issues.apache.org/jira/browse/MADLIB-1210
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Neural Networks
> Reporter: Frank McQuillan
> Priority: Major
> Fix For: v1.15
>
> Attachments: Momentum methods comparison.xlsx
>
>
> Story
> As a data scientist,
> I want to use momentum methods in MLP,
> so that I get significantly better convergence behavior.
> Details
> Adding momentum will get the MADlib MLP algorithm closer to state of the art.
> 1) Implement momentum term, default value ~0.9
> Ref [1]:
> "Momentum update is another approach that almost always enjoys better
> converge rates on deep networks."
> 2) Implement Nesterov momentum, default TRUE
> Ref [1]:
> "Nesterov Momentum is a slightly different version of the momentum update
> that has recently been gaining popularity. It enjoys stronger theoretical
> converge guarantees for convex functions and in practice it also consistently
> works slightly better than standard momentum."
> Ref [2]
> "Nesterov’s accelerated gradient (abbrv. NAG; Nesterov, 1983) is a
> first-order optimization method which is proven to have a better convergence
> rate guarantee than gradient descent for general convex functions with
> Lipshitz-continuous derivatives (O(1/T2) versus O(1/T))"
> Interface
> There are 2 new optimization params for momentum, which apply for both
> classification and regression:
> {code}
> 'learning_rate_init = <value>,
> learning_rate_policy = <value>,
> gamma = <value>,
> power = <value>,
> iterations_per_step = <value>,
> n_iterations = <value>,
> n_tries = <value>,
> lambda = <value>,
> tolerance = <value>,
> batch_size = <value>,
> n_epochs = <value>,
> momentum = <value>,
> nesterov_momentum= <value>'
> momentum
> Default: 0.9. Momentum can help accelerate learning and
> avoid local minima when using gradient descent. Value must be in the
> range 0 to 1, where 0 means no momentum.
> nesterov_momentum
> Default: TRUE. Nesterov momentum can provide better results than using
> classical momentum alone, due to its look ahead characteristics.
> In classical momentum you first correct velocity and step with that
> velocity, whereas in Nesterov momentum you first step in the velocity
> direction then make a correction to the velocity vector based on
> new location.
> Nesterov momentum is only used when the 'momentum' parameter is > 0.
> {code}
> Open questions
> 1) Does momentum and Nesterov momentum work equally well with and without
> mini-batching?
> Is there any guidance we need to give to users on this?
> Acceptance
> [1] Compare the usefulness of momentum with and without Nesterov, mini-batch,
> and SGD. Use a 2D Rosenbrock function to compare in a similar way to test
> ref [100] in the comment further down, i.e., loss by iteration number.
> [2] Use another well behaved function (TBD) and run similar tests as in [1]
> above.
> [3] Test with MNIST.
> [4] Test with CIFAR-10 or CIFAR-100
> http://www.cs.toronto.edu/~kriz/cifar.html
> References
> [1] http://cs231n.github.io/neural-networks-3/#sgd
> [2] http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf, a
> link from previous source.
> [3]
> http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms
> [4]
> http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
> [5] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)