[ 
https://issues.apache.org/jira/browse/MADLIB-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468098#comment-16468098
 ] 

Frank McQuillan edited comment on MADLIB-1210 at 5/9/18 12:56 AM:
------------------------------------------------------------------

For testing, I'd suggest we start with the 2D Rosenbrock function since it is 
well 
understood and useful for comparing optimization methods.  There are existing 
runs
as in ref [100] below that are useful guidance.

When we get these results, we can decide on the next function to test.

It is better to start with a well behaved function rather than a real-world 
data set,
which may have noise and other characteristics that make apples to apples 
comparison a challenge.

>From ref [103] below:
"Simple functions like Rosenbrock's are used to debug and pre-test newly 
written algorithms: They are fast to implement and to execute, and a method 
that cannot solve the standard problems well is unlikely to work well on real 
life problems.""

Test references

[100]
Nesterov Accelerated Gradient and Momentum
https://jlmelville.github.io/mize/nesterov.html

See in particular section on testing with 2D Rosenbrock function
https://jlmelville.github.io/mize/nesterov.html#testing_with_rosenbrock

[101] 
Rosenbrock function
https://en.wikipedia.org/wiki/Rosenbrock_function

[102]
Comparison of derivative-free optimization algorithms
http://archimedes.cheme.cmu.edu/?q=dfocomp

[103]
Testing numerical optimization methods: Rosenbrock vs. real test functions
https://scicomp.stackexchange.com/questions/3029/testing-numerical-optimization-methods-rosenbrock-vs-real-test-functions/19888

[104]
Unit Tests for Stochastic Optimization
https://arxiv.org/pdf/1312.6055.pdf




was (Author: fmcquillan):
If anyone knows of a good data set to demonstrate the virtues of momentum and 
Nesterov momentum for MLP with gradient descent, please add link(s) to this 
JIRA.

> Add momentum methods to MLP
> ---------------------------
>
>                 Key: MADLIB-1210
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1210
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Neural Networks
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.15
>
>
> Story
> As a data scientist,
> I want to use momentum methods in MLP,
> so that I get significantly better convergence behavior.
> Details
> Adding momentum will get the MADlib MLP algorithm closer to state of the art.
> 1) Implement momentum term, default value ~0.9
> Ref [1]:
> "Momentum update is another approach that almost always enjoys better 
> converge rates on deep networks." 
> 2) Implement Nesterov momentum, default TRUE
> Ref [1]:
> "Nesterov Momentum is a slightly different version of the momentum update 
> that has recently been gaining popularity. It enjoys stronger theoretical 
> converge guarantees for convex functions and in practice it also consistently 
> works slightly better than standard momentum."
> Ref [2]
> "Nesterov’s accelerated gradient (abbrv. NAG; Nesterov, 1983) is a 
> first-order optimization method which is proven to have a better convergence 
> rate guarantee than gradient descent for general convex functions with 
> Lipshitz-continuous derivatives (O(1/T2) versus O(1/T))"
> Interface
> There are 2 new optimization params for momentum, which apply for both 
> classification and regression:
> {code}
> 'learning_rate_init = <value>,
> learning_rate_policy = <value>,
> gamma = <value>,
> power = <value>,
> iterations_per_step = <value>,
> n_iterations = <value>,
> n_tries = <value>,
> lambda = <value>,
> tolerance = <value>,
> batch_size = <value>,
> n_epochs = <value>,
> momentum = <value>,
> nesterov_momentum= <value>'
> momentum
> Default: 0.9. Momentum can help accelerate learning and 
> avoid local minima when using gradient descent. Value must be in the 
> range 0 to 1, where 0 means no momentum.
> nesterov_momentum
> Default: TRUE. Nesterov momentum can provide better results than using
> classical momentum alone, due to its look ahead characteristics.  
> In classical momentum you first correct velocity and step with that 
> velocity, whereas in Nesterov momentum you first step in the velocity 
> direction then make a correction to the velocity vector based on 
> new location.
> Nesterov momentum is only used when the 'momentum' parameter is > 0.
> {code}
> Open questions
> 1) Does momentum and Nesterov momentum work equally well with and without 
> mini-batching?
> Is there any guidance we need to give to users on this?
> Acceptance
> [1] Find/create a dataset that can be used to compare the usefulness of 
> momentum with and without Nesterov, mini-batch, and SGD. This usefulness can 
> be compared based on convergence, which includes both speed and accuracy.
> References
> [1] http://cs231n.github.io/neural-networks-3/#sgd
> [2] http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf, a 
> link from previous source.
> [3] 
> http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms
> [4] 
> http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
> [5] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to