[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350812#comment-14350812
 ] 

Aaron Staple commented on SPARK-1503:
-------------------------------------

I recently created a PR for an implementation of accelerated gradient descent 
without backtracking, as discussed above.

I also ran some simple, small scale (single server) benchmarks to compare 
different optimization methods. (The benchmark result graphs are provided as 
images attached to this ticket.) The optimization methods tested were:

- gra: existing gradient descent implementation (using full batch)
- acc: accelerated descent (as implemented in the PR), but without automatic 
restart
- acc_r: accelerated descent, with automatic restart
- acc_b: accelerated descent, with backtracking, without automatic restart
- acc_rb: accelerated descent, with backtracking, with automatic restart
- lbfgs: existing lbfgs implementation

(Note that backtracking is not part of the PR, and was just tested for.)

The x axis shows the number of outer loop iterations of the optimization 
algorithm. (Note that for backtracking implementations, the full cost of 
backtracking is not represented in this outer loop count. For non backtracking 
implementations, the number of outer loop iterations is the same as the number 
of spark map reduce jobs). The y axis is the log of the difference from best 
determined optimized value.

The optimization test runs were:

- linear: A scaled up version of the test data from TFOCS’s test_LASSO.m 
example, with 10000 observations on 1024 features. 512 of the features are 
actually correlated with result. Unregularized linear regression was used. (The 
scala acceleration implementation was observed to be consistent with the TFOCS 
implementation on this dataset.)
- linear l1: The same as ‘linear’, but with l1 regularization
- logistic: Each feature of each observation is generated by summing a feature 
gaussian specific to the observation’s binary category with a noise gaussian. 
10000 observations on 250 features. Unregularized logistic regression was used.
- logistic l2: Same as ‘logistic’, but using l2 regularization

Note that for each test run, all optimization methods were given the same 
initial step size.

Observations:
- Acceleration consistently converges more quickly than standard gradient 
descent, given the same initial step size.
- Automatic restart is helpful for acceleration convergence
- Backtracking can significantly boost convergence rates in some cases 
(measured in terms of outer loop iterations). But the full cost of backtracking 
was not measured in these runs.
- lbfgs generally outperformed accelerated gradient descent in these test runs. 
Accelerated gradient descent was competitive with lbfgs for linear l1 (lasso) 
regression. However as noted in the documentation, the L1Updater “will not 
work” for the lbfgs implementation. It seems l1 regularization is currently a 
weak spot for the lbfgs implementation.

> Implement Nesterov's accelerated first-order method
> ---------------------------------------------------
>
>                 Key: SPARK-1503
>                 URL: https://issues.apache.org/jira/browse/SPARK-1503
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Aaron Staple
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to