[ 
https://issues.apache.org/jira/browse/MADLIB-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1210:
------------------------------------
    Description: 
Story

As a data scientist,
I want to use momentum methods in MLP,
so that I get significantly better convergence behavior.

Details

Adding momentum will get the MADlib MLP algorithm closer to state of the art.

1) Implement momentum term, default value ~0.9

Ref [1]:
"Momentum update is another approach that almost always enjoys better converge 
rates on deep networks." 

2) Implement Nesterov momentum, default TRUE

Ref [1]:
"Nesterov Momentum is a slightly different version of the momentum update that 
has recently been gaining popularity. It enjoys stronger theoretical converge 
guarantees for convex functions and in practice it also consistently works 
slightly better than standard momentum."

Ref [2]
"Nesterov’s accelerated gradient (abbrv. NAG; Nesterov, 1983) is a first-order 
optimization method which is proven to have a better convergence rate guarantee 
than gradient descent for general convex functions with Lipshitz-continuous 
derivatives (O(1/T2) versus O(1/T))"

Interface

There are 2 new optimization params for momentum, which apply for both 
classification and regression:

{code}
'learning_rate_init = <value>,
learning_rate_policy = <value>,
gamma = <value>,
power = <value>,
iterations_per_step = <value>,
n_iterations = <value>,
n_tries = <value>,
lambda = <value>,
tolerance = <value>,
batch_size = <value>,
n_epochs = <value>,
momentum = <value>,
nesterov_momentum= <value>'

momentum
Default: 0.9. Momentum can help accelerate learning and 
avoid local minima when using gradient descent. Value must be in the 
range 0 to 1, where 0 means no momentum.

nesterov_momentum
Default: TRUE. Nesterov momentum can provide better results than using
classical momentum alone, due to its look ahead characteristics.  
In classical momentum you first correct velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity 
direction then make a correction to the velocity vector based on 
new location.

Nesterov momentum is only used when the 'momentum' parameter is > 0.
{code}

Open questions

1) Does momentum and Nesterov momentum work equally well with and without 
mini-batching?
Is there any guidance we need to give to users on this?

Acceptance

[1] Use a 2D Rosenbock 
[1] Find/create a dataset that can be used to compare the usefulness of 
momentum with and without Nesterov, mini-batch, and SGD. This usefulness can be 
compared based on convergence, which includes both speed and accuracy.

References

[1] http://cs231n.github.io/neural-networks-3/#sgd
[2] http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf, a link 
from previous source.
[3] 
http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms
[4] 
http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
[5] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

  was:
Story

As a data scientist,
I want to use momentum methods in MLP,
so that I get significantly better convergence behavior.

Details

Adding momentum will get the MADlib MLP algorithm closer to state of the art.

1) Implement momentum term, default value ~0.9

Ref [1]:
"Momentum update is another approach that almost always enjoys better converge 
rates on deep networks." 

2) Implement Nesterov momentum, default TRUE

Ref [1]:
"Nesterov Momentum is a slightly different version of the momentum update that 
has recently been gaining popularity. It enjoys stronger theoretical converge 
guarantees for convex functions and in practice it also consistently works 
slightly better than standard momentum."

Ref [2]
"Nesterov’s accelerated gradient (abbrv. NAG; Nesterov, 1983) is a first-order 
optimization method which is proven to have a better convergence rate guarantee 
than gradient descent for general convex functions with Lipshitz-continuous 
derivatives (O(1/T2) versus O(1/T))"

Interface

There are 2 new optimization params for momentum, which apply for both 
classification and regression:

{code}
'learning_rate_init = <value>,
learning_rate_policy = <value>,
gamma = <value>,
power = <value>,
iterations_per_step = <value>,
n_iterations = <value>,
n_tries = <value>,
lambda = <value>,
tolerance = <value>,
batch_size = <value>,
n_epochs = <value>,
momentum = <value>,
nesterov_momentum= <value>'

momentum
Default: 0.9. Momentum can help accelerate learning and 
avoid local minima when using gradient descent. Value must be in the 
range 0 to 1, where 0 means no momentum.

nesterov_momentum
Default: TRUE. Nesterov momentum can provide better results than using
classical momentum alone, due to its look ahead characteristics.  
In classical momentum you first correct velocity and step with that 
velocity, whereas in Nesterov momentum you first step in the velocity 
direction then make a correction to the velocity vector based on 
new location.

Nesterov momentum is only used when the 'momentum' parameter is > 0.
{code}

Open questions

1) Does momentum and Nesterov momentum work equally well with and without 
mini-batching?
Is there any guidance we need to give to users on this?

Acceptance

[1] Find/create a dataset that can be used to compare the usefulness of 
momentum with and without Nesterov, mini-batch, and SGD. This usefulness can be 
compared based on convergence, which includes both speed and accuracy.

References

[1] http://cs231n.github.io/neural-networks-3/#sgd
[2] http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf, a link 
from previous source.
[3] 
http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms
[4] 
http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
[5] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf


> Add momentum methods to MLP
> ---------------------------
>
>                 Key: MADLIB-1210
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1210
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Neural Networks
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v1.15
>
>
> Story
> As a data scientist,
> I want to use momentum methods in MLP,
> so that I get significantly better convergence behavior.
> Details
> Adding momentum will get the MADlib MLP algorithm closer to state of the art.
> 1) Implement momentum term, default value ~0.9
> Ref [1]:
> "Momentum update is another approach that almost always enjoys better 
> converge rates on deep networks." 
> 2) Implement Nesterov momentum, default TRUE
> Ref [1]:
> "Nesterov Momentum is a slightly different version of the momentum update 
> that has recently been gaining popularity. It enjoys stronger theoretical 
> converge guarantees for convex functions and in practice it also consistently 
> works slightly better than standard momentum."
> Ref [2]
> "Nesterov’s accelerated gradient (abbrv. NAG; Nesterov, 1983) is a 
> first-order optimization method which is proven to have a better convergence 
> rate guarantee than gradient descent for general convex functions with 
> Lipshitz-continuous derivatives (O(1/T2) versus O(1/T))"
> Interface
> There are 2 new optimization params for momentum, which apply for both 
> classification and regression:
> {code}
> 'learning_rate_init = <value>,
> learning_rate_policy = <value>,
> gamma = <value>,
> power = <value>,
> iterations_per_step = <value>,
> n_iterations = <value>,
> n_tries = <value>,
> lambda = <value>,
> tolerance = <value>,
> batch_size = <value>,
> n_epochs = <value>,
> momentum = <value>,
> nesterov_momentum= <value>'
> momentum
> Default: 0.9. Momentum can help accelerate learning and 
> avoid local minima when using gradient descent. Value must be in the 
> range 0 to 1, where 0 means no momentum.
> nesterov_momentum
> Default: TRUE. Nesterov momentum can provide better results than using
> classical momentum alone, due to its look ahead characteristics.  
> In classical momentum you first correct velocity and step with that 
> velocity, whereas in Nesterov momentum you first step in the velocity 
> direction then make a correction to the velocity vector based on 
> new location.
> Nesterov momentum is only used when the 'momentum' parameter is > 0.
> {code}
> Open questions
> 1) Does momentum and Nesterov momentum work equally well with and without 
> mini-batching?
> Is there any guidance we need to give to users on this?
> Acceptance
> [1] Use a 2D Rosenbock 
> [1] Find/create a dataset that can be used to compare the usefulness of 
> momentum with and without Nesterov, mini-batch, and SGD. This usefulness can 
> be compared based on convergence, which includes both speed and accuracy.
> References
> [1] http://cs231n.github.io/neural-networks-3/#sgd
> [2] http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf, a 
> link from previous source.
> [3] 
> http://ruder.io/optimizing-gradient-descent/index.html#gradientdescentoptimizationalgorithms
> [4] 
> http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
> [5] https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to